Usage

We will demonstrate an example of the sainsc workflow using the Stereo-seq Mouse Embryo E16.5 (E1S3) from the original Stereo-seq publication.

The Stereo-seq dataset used in this analysis is available at https://db.cngb.org/stomics/mosta.

To follow along make sure you installed sainsc with the ‘data’ extra (pip install sainsc[data]) and the scanpy, squidpy, and colorcet package.

First we need to load some essential packages and define the input file we are going to use.

from pathlib import Path

import numpy as np
import pandas as pd

from sainsc import read_StereoSeq

# location of the Stereo-seq data
stereo_seq_file = Path("path/to/stereoseq/data") / "E16.5_E1S3_GEM_bin1.tsv.gz"

Load and prepare data

First we are going to load the Stereo-seq data from the GEM file.

embryo = read_StereoSeq(stereo_seq_file)

print(embryo)

LazyKDE (32 threads)
    genes: 28633
    shape: (44100, 26460)
    resolution: 500.0 nm / px

First, we want to get a quick overview of what our sample looks like.

embryo.calculate_total_mRNA()

_ = embryo.plot_genecount(im_kwargs={"vmax": 5})

_images/3b2d779d34b3a77226dbe34eceee0ff662b987899b39392301ce8bc671666779.png

Selecting the region of interest

We can see that the region of interest in our case is much smaller than the total area of our sample. Cropping and masking helps to focus only on the relevant parts of our tissue, and on top will reduce memory usage and processing time.

So as a first step we can crop away the parts of the tissue that we don’t need. Make sure to recalculate the total mRNA as otherwise the previously calculated counts will be shown.

embryo.counts.crop((5_000, 33_000), (9_000, None))

embryo.calculate_total_mRNA()

_ = embryo.plot_genecount(im_kwargs={"vmax": 5})

_images/0c507095da0a089ff8faadf4c0f92e9bad5aaf60641ff241f08fee7c9a2659f9.png

If we want to go even further we can “mask” the region of interest. For example, we can export the count array and use it to manually draw a selection mask.

from PIL import Image

# export total mRNA as png
u8 = np.uint8
Image.fromarray(
    np.clip(embryo.total_mRNA, np.iinfo(u8).min, np.iinfo(u8).max).astype(u8)
).save("embryo_counts.png")

This is what are region of interest (ROI) looks like after manually drawing the outline …

import matplotlib.pyplot as plt

from sainsc.datasets import fetch_embryo_mask

# avoid the DecompressionBombError
Image.MAX_IMAGE_PIXELS = None

mask_file = fetch_embryo_mask()
embryo_mask = np.array(Image.open(mask_file), dtype=bool)

_ = plt.imshow(embryo_mask.T, origin="lower")

_images/a97baeb9124edf44fcb235f80c723b25ce42af1f115d52728fa2d9e72ffd2139.png

Using it to filter the counts is easy, just make sure that the shapes of the mask and the sample are the same.

embryo.counts.filter_mask(embryo_mask)

We can filter the genes that have no counts left after cropping and masking.

embryo.counts.filter_genes_by_count(1)

Let’s check the result.

embryo.calculate_total_mRNA()

_ = embryo.plot_genecount(im_kwargs={"vmax": 5})

_images/7e012967ca156cc646b643ca729ce54ee10e97ab2bc673a3fd0ae4b284c54f38.png

We can visualize the number of counts per gene as a histogram to get a quick overview of the distribution.

_ = embryo.plot_genecount_histogram()

_images/16f1bf30ca596a03f484b048ffd114efa7e35aae2522abe072e8254ce22aa694.png

Additional functionality like filtering the genes by counts, or selecting genes by names are available for sainsc.GridCounts.

KDE

To estimate the kernel density we first need to define a kernel. This can easily be done using the sainsc.LazyKDE.gaussian_kernel method, allowing us to either specify the bandwidth in pixels or um. Generally, we found ~ 4 um to work for most samples.

Except for the total mRNA, the KDE is not stored but rather calculated whenever it is needed (e.g. for plotting or assigning cell types).

embryo.gaussian_kernel(4, unit="um")

embryo.calculate_total_mRNA_KDE()

_ = embryo.plot_KDE()

_images/9fdeec7b18615deb7b4807571630b97b3b6c249e263f52509dd0cd746509583d.png

Filtering background

The histogram of the total mRNA KDE can be useful to define which areas are likely to be background (i.e. empty, noisy areas).

_ = embryo.plot_KDE_histogram(bins=100)

_images/ec09f7d623284b6d47b937fd0abd767f713c2af6fb7a384b40b6fff6cd5d68ad.png

Here we will use a rough estimation of 0.9, but will later make use of the fact that we can also filter the background with a threshold per cell type.

Background will not be removed so data will not be overwritten to avoid recomputation. Rather corresponding values are set to 0 when plotting or similar.

embryo.filter_background(0.9)

_ = embryo.plot_KDE(remove_background=True)

_images/ca2e6a6efa924eecd12a384b9556362ffde99c1832e083541b921f95a9d3bacc.png

Cell-type assignment

All we need to assign a cell type to every pixel of our sample is a kernel, which we previously already defined, and a set of gene expression signatures.

The gene expression signatures can be based on prior knowledge (e.g. scRNAseq experiments) or can be computed de novo which we will demonstrate later.

Known signatures

Here, our signatures are pre-computed already and based on the segmented data published in the original Stereo-seq publication by Chen et al 2022 using the sainsc.utils.celltype_signatures function.

from sainsc.datasets import fetch_embryo_signatures

signatures = fetch_embryo_signatures()
signatures.iloc[:, :4].head()

	Cardiomyocyte	Chondrocyte	Choroid plexus	Dorsal midbrain neuron
gene
1110017D15Rik	0.000022	0.000039	0.001147	0.000077
1110035H17Rik	0.000003	0.000002	0.000000	0.000010
1190001M18Rik	0.000002	0.000003	0.000000	0.000005
1300014J16Rik	0.000003	0.000005	0.000000	0.000004
1500015O10Rik	0.000069	0.000884	0.003061	0.000075

Now we can assign the cell type. This is usually the most computationally expensive step.

embryo.assign_celltype(signatures)

print(embryo)

LazyKDE (32 threads)
    genes: 28558
    shape: (28000, 17460)
    resolution: 500.0 nm / px
    kernel: (33, 33)
    background: set
    celltypes: 25

Visualizing the cell-type map is straightforward. You can use a custom color-mapping for the cell types as well.

import colorcet as cc
import seaborn as sns

cmap = dict(
    zip(embryo.celltypes, sns.color_palette(cc.glasbey, n_colors=len(embryo.celltypes)))
)

_ = embryo.plot_celltype_map(cmap=cmap)

_images/e03846a2ddce5a6014b095e1e459a2aec59c4b5b78018e5bb28e09835c1a93f8.png

As most plotting functions, it also allows “zooming” into regions more closely.

_ = embryo.plot_celltype_map(
    crop=((2_000, 6_000), (10_500, 15_000)),
    cmap=cmap,
    scalebar_kwargs={"location": "upper left"},
)

_images/0b8207a7be0b8bca52ddc14b60a1a578e4dc0aba04a9617829d4b25b88e32b66.png

The assignment score is helpful to identify regions that have been assigned with high confidence, vs regions with low confidence. Please refer to the sainsc publication for details.

_ = embryo.plot_assignment_score()

_images/8f40cd27e178eb54473dc65f4b92c19196efde1cfd87a231f03788f1ad1d6433.png

De novo cell-type signatures

To generate a cell-type map without prior knowledge of the cell types or their expression profiles, we can use expression signatures that we generate from the data directly.

Briefly, we identify local maxima of the gene expression as proxies for cells. We load the KDE of these cell proxies and use it to cluster the data following standard single-cell RNA sequencing workflows. Using the expression of the extracted clusters, we can generate the cell-type map.

First off, we need to identify the local maxima. We only need to specify the minimum distance between two maxima, but it can be helpful to filter the background and only identify maxima in connected regions of a minimum size.

embryo.find_local_maxima(4, min_area=60)

print(embryo)

LazyKDE (32 threads)
    genes: 28558
    shape: (28000, 17460)
    resolution: 500.0 nm / px
    kernel: (33, 33)
    background: set
    local maxima: 171503
    celltypes: 25

Relevant genes can be found either;

through prior knowledge from e.g. literature
by loading the data in bins using sainsc.io.read_StereoSeq_bins and then identifying highly-variable (HVG) or spatially-variable genes (SVG)
loading all genes for the local maxima and detect HVGs/SVGs based on their expression

Here, we will follow the last approach. We load the local maxima as AnnData …

cellproxy_adata = embryo.load_local_maxima(embryo.genes)

print(cellproxy_adata)

AnnData object with n_obs × n_vars = 171503 × 28558
    obs: 'total_mRNA_KDE', 'cosine_similarity', 'assignment_score', 'celltype'
    obsm: 'spatial'

… and process it using a standard scanpy/squidpy workflow.

import scanpy as sc

# remove potential genes w/o counts
cellproxy_adata = cellproxy_adata[:, cellproxy_adata.X.sum(axis=0).A1 > 0].copy()

# "normalize" data
sc.pp.normalize_total(cellproxy_adata)
sc.pp.log1p(cellproxy_adata)

import squidpy as sq

# random seed for reproducibility
seed = 42

# find SVGs
sq.gr.spatial_neighbors(cellproxy_adata, coord_type="generic")
sq.gr.spatial_autocorr(cellproxy_adata, mode="moran", seed=seed)

svg = cellproxy_adata.uns["moranI"]["I"].dropna().nlargest(2_000).index

# find cell-type clusters
sc.pp.pca(
    cellproxy_adata, mask_var=cellproxy_adata.var_names.isin(svg), random_state=seed
)
sc.pp.neighbors(cellproxy_adata, random_state=seed)
sc.tl.umap(cellproxy_adata, min_dist=0.1, random_state=seed)
sc.tl.leiden(cellproxy_adata, resolution=1.5, random_state=seed)
sc.pl.umap(cellproxy_adata, color="leiden")

/tmp/ipykernel_2270617/2345846166.py:7: FutureWarning: In the future, the default backend for leiden will be igraph instead of leidenalg.

 To achieve the future defaults please pass: flavor="igraph" and n_iterations=2.  directed must also be False to work with igraph's implementation.
  sc.tl.leiden(cellproxy_adata, resolution=1.5, random_state=seed)

_images/35d487e62a30b6672c33166a0929c44aaf6394804771a73a4f81f90fc014d09e.png

Once we have identified our cell-type clusters we can extract the cell type signatures and proceed as previously.

from sainsc.utils import celltype_signatures

signatures_denovo = celltype_signatures(cellproxy_adata[:, svg], celltype_col="leiden")

embryo.assign_celltype(signatures_denovo, log=True)

cmap_denovo = dict(
    zip(
        cellproxy_adata.obs["leiden"].cat.categories,
        cellproxy_adata.uns["leiden_colors"],
    )
)

_ = embryo.plot_celltype_map(cmap=cmap_denovo)

_images/310cbb7a7c480b3022428ef27b0677fefd745b6344fb2801552011fc428fee8c.png

Miscellaneous

Analysing other spatial transcriptomics technologies

To analyse other technologies beside Stereo-seq such as imaging-based (e.g. Xenium) we only need to generate a sainsc.GridCounts or sainsc.LazyKDE instance from a dataframe and then proceed as previously described. If the data is stored in GEM file format you can use the sainsc.io.read_gem_file function to conveniently read the file and ensure the correct format. Otherwise we can manually prepare the dataframe as shown below for the case of Xenium data (obtained from 10x website).

The dataframe needs to contain a ‘gene’, ‘x’, and ‘y’ column. If a ‘count’ column is present it will be used otherwise a count of 1 (single transcript) will be assumed for each row. If the ‘x’ and ‘y’ columns are integers they will be assumed as indices of a square grid. Otherwise we can additionally define a bin size to use when instantiating the objects.

import polars as pl

xenium_file = Path("path/to/xenium/data") / "transcripts.csv.gz"
n_threads = 16

# Read xenium file, rename columns and filter blanks/controls
transcripts = (
    pl.read_csv(
        xenium_file,
        columns=["feature_name", "x_location", "y_location"],
        schema_overrides={"feature_name": pl.Categorical},
        n_threads=n_threads,
    )
    .rename({"feature_name": "gene", "x_location": "x", "y_location": "y"})
    .filter(~pl.col("gene").cast(pl.Utf8).str.contains("(BLANK|NegControl)"))
)

We can now generate a sainsc.GridCounts object from this dataframe. We will generate bins with a size of 500 nm.

from sainsc import GridCounts

xenium_counts = GridCounts.from_dataframe(
    transcripts, resolution=1000, binsize=0.5, n_threads=n_threads
)

print(xenium_counts)

GridCounts (16 threads)
    genes: 248
    shape: (20545, 14065)

Alternatively we can directly generate a sainsc.LazyKDE object. This has the additional benefit, that we can supply either a polars.DataFrame or a pandas.DataFrame. sainsc.GridCounts can only be generated from a polars.DataFrame.

from sainsc import LazyKDE

xenium = LazyKDE.from_dataframe(
    transcripts, resolution=1000, binsize=0.5, n_threads=n_threads
)

print(xenium)

LazyKDE (16 threads)
    genes: 248
    shape: (20545, 14065)
    resolution: 500.0 nm / px

Visualizing genes

Some plotting functions allow for gene-specific plots instead of using the total mRNA.

_ = embryo.plot_KDE(gene="Ttr", crop=((5_000, 6_500), (4_500, 6_000)))

_images/ee4d54185b68f7a6f8367629640a4215fe2a4ffdfe837fa06ec51724e5e00333.png

Cell type-specific thresholds

Background filtering support cell type-specific thresholding, this can be useful as the total mRNA in some cell types might be higher (especially when working with smaller gene panels such as from Xenium data).

kde_per_celltype = pd.DataFrame(
    {
        "kde": embryo.total_mRNA_KDE.flatten(),
        "celltype": pd.Categorical.from_codes(
            embryo.celltype_map.ravel(), categories=embryo.celltypes
        ),
    }
).dropna()

celltype_threshold = (
    kde_per_celltype.groupby("celltype", observed=True).median()["kde"] / 2
).to_dict()

min_t = 0.9

celltype_threshold = {
    ct: (t if t > min_t else min_t) for ct, t in celltype_threshold.items()
}

embryo.filter_background(celltype_threshold)

Saving the analysis

Both, the sainsc.LazyKDE and sainsc.GridCounts classes, can be serialized using the python pickle module. This way you can always come back to where you left off.

import pickle

with open("my_analysis.pickle", "wb") as file:
    pickle.dump(embryo, file)