Loading data
In this tutorial, we will cover how to get your data loaded and ready to be analyzed by sainsc.
Some of the common technologies are supported by default.
The typical output of the functions to load your data is a sainsc.LazyKDE object, which you can use to perform the analysis.
When printed, it will show us some basic information, including the resolution per pixel, the number of genes, and the size of the dataset.
Its ‘counts’ attribute contains a sainsc.GridCounts object, which actually stores the transcript information.
Though there is rarely a need to directly interact with it.
Data from imaging-based transcriptomics will be binned to a size of 0.5 µm per pixel by default, so this can be customized when loading the data.
When loading the dataset, you can already specify how many threads you want to use to perform the analysis later on. Generally, we recommend not using more threads than you have CPUs available.
from pathlib import Path
10x Genomics Xenium
Xenium data can be loaded using sainsc directly. The coordinate information is stored in µm, and by default, the binning is performed to 500 nm per pixel. Control probes will be removed automatically.
Both the .csv.gz and .parquet files are supported, though the latter should be preferred due to better performance.
To demonstrate, we will use the Xenium Prime Mouse Pup dataset (available from the 10x Genomics website).
from sainsc.io import read_Xenium
# .csv.gz works too, but .parquet is preferred
transcripts_file = Path("path/to/xenium_data") / "transcripts.parquet"
xenium = read_Xenium(transcripts_file)
print(xenium)
LazyKDE (16 threads)
genes: 5010
shape: (22901, 47623)
resolution: 500.0 nm / px
10x Genomics Atera
The Atera output format is basically identical to the new versions of Xenium. Atera data will also be binned to 500 nm by default, and all control probes will be removed.
To demonstrate, we will use the Atera Human Cervical Cancer dataset (available from the 10x Genomics website).
from sainsc.io import read_Atera
transcripts_file = Path("path/to/atera_data") / "transcripts.parquet"
atera = read_Atera(transcripts_file)
print(atera)
LazyKDE (16 threads)
genes: 18028
shape: (17858, 17799)
resolution: 500.0 nm / px
Vizgen MERSCOPE
The Vizgen MERSCOPE output is similar to Xenium in the sense that the coordinates are also stored in µm. The data will be binned to a 500 nm pixel size by default. All control probes will be removed.
Both the .csv and .parquet files (only available for newer datasets) are supported. However, the .parquet file should be preferred if available.
The Vizgen MERSCOPE Liver showcase dataset is used for demonstration (available here).
from sainsc.io import read_MERSCOPE
transcripts_file = Path("path/to/merscope_data") / "detected_transcripts.csv"
merscope = read_MERSCOPE(transcripts_file)
print(merscope)
LazyKDE (16 threads)
genes: 347
shape: (20654, 19420)
resolution: 500.0 nm / px
Stereo-seq
Stereo-seq data can be directly loaded from the GEM file that stores the unbinned transcript measurements, i.e., the counts per DNA-nanoball. The coordinates should be in integer coordinates, and the resolution depends on the chip but is typically 500 nm. The chip and its resolution can usually be automatically detected from the metadata. But it never hurts to double-check that you got the right information.
Here, for the demonstration, we will use the Stereo-seq Mouse Embryo E16.5 (E1S3) from the original Stereo-seq publication (available at https://db.cngb.org/stomics/mosta).
from sainsc.io import read_StereoSeq
# location of the Stereo-seq GEM file
gem_file = Path("path/to/stereoseq_data") / "E16.5_E1S3_GEM_bin1.tsv.gz"
stereo_seq = read_StereoSeq(gem_file)
print(stereo_seq)
LazyKDE (16 threads)
genes: 28633
shape: (44100, 26460)
resolution: 500.0 nm / px
10x Genomics VisiumHD
We will look at a colorectal cancer (CRC) sample profiled using VisiumHD. The data is available from the 10x Genomics website (see here). For VisiumHD, we need to provide the path to the unbinned 2 µm data of the sample.
The reduced resolution makes VisiumHD less suitable for unsupervised cell-type assignment from local maxima. However, with suitable reference data or using the larger bin sizes, the annotation via label transfer can improve the resolution of the assigned cell types (see the dedicated VisiumHD tutorial).
from sainsc.io import read_VisiumHD
# location of the unbinned, 2-µm resolution data
unbinned_path = Path("path/to/visiumhd_data") / "binned_outputs" / "square_002um"
visium_hd = read_VisiumHD(unbinned_path)
print(visium_hd)
LazyKDE (16 threads)
genes: 18072
shape: (3350, 3350)
resolution: 2000.0 nm / px
Other technologies
To analyse other technologies besides the ones listed above (also have a look at the sainsc.io module), or if you need to pre-filter your dataset, we only need to generate a sainsc.GridCounts or sainsc.LazyKDE instance from a DataFrame.
If the data is stored in GEM file format, you can use the sainsc.io.read_gem_file function to conveniently read the file and ensure the correct format.
In case the technology/file format is not supported, we can manually prepare the data as shown below for the case of an Xenium dataset. The dataframe needs to contain a ‘gene’, ‘x’, and ‘y’ column. If a ‘count’ column is present, it will be used; otherwise, a count of 1 (single transcript) will be assumed for each row. If the ‘x’ and ‘y’ columns are integers, they will be assumed as indices of a square grid. Otherwise, we can additionally define a bin size to use when instantiating the objects.
import polars as pl
xenium_file = Path("path/to/xenium_data") / "transcripts.parquet"
# Read Xenium file, rename columns, and filter blanks/controls
transcripts = (
pl.scan_parquet(xenium_file)
.filter(pl.col("is_gene"))
.rename({"feature_name": "gene", "x_location": "x", "y_location": "y"})
.select(["gene", "x", "y"])
.with_columns(pl.col("gene").cast(pl.Categorical))
.collect()
)
transcripts.head()
| gene | x | y |
|---|---|---|
| cat | f32 | f32 |
| "Adam10" | 196.375 | 4348.546875 |
| "Adamts3" | 120.0625 | 4474.3125 |
| "Aldh1a3" | 218.828125 | 4420.296875 |
| "Batf3" | 245.625 | 4487.671875 |
| "Casp3" | 208.421875 | 4469.578125 |
We can now generate a sainsc.GridCounts object from this polars.DataFrame.
The resolution defines how large one unit of the coordinates is in nm.
Here, our coordinates are measured in µm, and therefore we set the resolution to 1,000.
To generate bins with a size of 500 nm, we set the binsize to 0.5.
from sainsc import GridCounts
n_threads = 8
xenium_counts = GridCounts.from_dataframe(
transcripts, resolution=1_000, binsize=0.5, n_threads=n_threads
)
print(xenium_counts)
GridCounts (8 threads)
genes: 5010
shape: (22901, 47623)
resolution: 500.0 nm / px
Alternatively, we can directly generate a sainsc.LazyKDE object.
This has the additional benefit that we can supply either a polars.DataFrame or a pandas.DataFrame.
sainsc.GridCounts can only be generated from a polars.DataFrame.
from sainsc import LazyKDE
xenium = LazyKDE.from_dataframe(
transcripts, resolution=1_000, binsize=0.5, n_threads=n_threads
)
print(xenium)
LazyKDE (8 threads)
genes: 5010
shape: (22901, 47623)
resolution: 500.0 nm / px