Quick Start (Multi-sample)#

Multi samples#

Multi-sample data set consists of continuous or time-series samples. This quick start would help you learn about handling them rapidly.

MSData construction#

Please download the example data first. Here we use drosophila data for demo. Main input file formats are GEM/GEF (from Stereo-seq), H5ad (from Scanpy). Add your dataset as below:

[2]:
import sys
import os
from natsort import natsorted
import stereo as st
from stereo.core.ms_data import MSData
from stereo.core.ms_pipeline import slice_generator

import warnings
warnings.filterwarnings('ignore')


# prepara for input directory
data_dir = './Demo_3D/3D_AnnData_0.8.0'

data_list=[]
for fn in os.listdir(data_dir):
    data_list.append(os.path.join(data_dir, fn))

# ensure data order by naming them regularly
data_list = natsorted(data_list)

# construct MSData object
ms_data = MSData(_relationship='other', _var_type='intersect')

# when come to loaded data object
# ms_data = MSData(_data_list=[data1, data2], _names=['s1', 's2'], _relationship='other', _var_type='intersect')

# add all samples into MSData
for sample in data_list:
    ms_data += st.io.read_h5ad(file_path=sample, bin_type='bins', bin_size=1)

After loading sorted data into MSData object, just type it to obtain basic information.

[3]:
ms_data
[3]:
ms_data: {'0': (482, 13668), '1': (549, 13668), '2': (598, 13668), '3': (713, 13668), '4': (744, 13668), '5': (815, 13668), '6': (925, 13668), '7': (1272, 13668), '8': (1263, 13668), '9': (1248, 13668), '10': (1039, 13668), '11': (1260, 13668), '12': (959, 13668), '13': (1078, 13668), '14': (1240, 13668), '15': (1110, 13668)}
num_slice: 16
names: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15']
obs: []
var: []
relationship: other
var_type: intersect to 0
mss: []

Get one of samples like when you work with Python list.

[4]:
ms_data[0]
[4]:
AnnData object with n_obs × n_vars = 482 × 13668
    obs: 'slice_ID', 'raw_x', 'raw_y', 'new_x', 'new_y', 'new_z', 'annotation'
    uns: 'bin_type', 'bin_size', 'sn'
    obsm: 'X_umap', 'spatial', 'spatial_elas', 'spatial_rigid'
    layers: 'raw_counts'

Note

The analysis results, like quality control metrics, UMAP and … , are just stayed in each sample data object, which are not associated with multiple samples. When it comes to the usage of each sample annotation result, a simple function, ms_data.to_integrate(), will be introduced later to meet your analysis needs.

The index of each sample is related to the order they were added, and names can be changes as below:

[5]:
name_list=[]
for i in range(16):
    name_list.append('S'+ str(i))

# pass name list to MSData
ms_data.names=name_list
ms_data
[5]:
ms_data: {'S0': (482, 13668), 'S1': (549, 13668), 'S2': (598, 13668), 'S3': (713, 13668), 'S4': (744, 13668), 'S5': (815, 13668), 'S6': (925, 13668), 'S7': (1272, 13668), 'S8': (1263, 13668), 'S9': (1248, 13668), 'S10': (1039, 13668), 'S11': (1260, 13668), 'S12': (959, 13668), 'S13': (1078, 13668), 'S14': (1240, 13668), 'S15': (1110, 13668)}
num_slice: 16
names: ['S0', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'S10', 'S11', 'S12', 'S13', 'S14', 'S15']
obs: []
var: []
relationship: other
var_type: intersect to 0
mss: []

And sample names can be reset using ms_data.reset_name().

[6]:
ms_data.reset_name()
ms_data
[6]:
ms_data: {'0': (482, 13668), '1': (549, 13668), '2': (598, 13668), '3': (713, 13668), '4': (744, 13668), '5': (815, 13668), '6': (925, 13668), '7': (1272, 13668), '8': (1263, 13668), '9': (1248, 13668), '10': (1039, 13668), '11': (1260, 13668), '12': (959, 13668), '13': (1078, 13668), '14': (1240, 13668), '15': (1110, 13668)}
num_slice: 16
names: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15']
obs: []
var: []
relationship: other
var_type: intersect to 0
mss: []

The relationship means literally the correlation between/among multi samples, which is default to other and could be changed simply.

[7]:
ms_data.relationship='continuous'

Note

ms_data.integrate() is necessarily to be performed after data loading. Default method is intersect, which means to take the intersection of genes (var) for subsequent multi-sample analysis. After integration, _var_type shows the intersect gene number from 0 to 13668. Otherwise here also provide union method.

[8]:
ms_data.integrate()
ms_data
[8]:
ms_data: {'0': (482, 13668), '1': (549, 13668), '2': (598, 13668), '3': (713, 13668), '4': (744, 13668), '5': (815, 13668), '6': (925, 13668), '7': (1272, 13668), '8': (1263, 13668), '9': (1248, 13668), '10': (1039, 13668), '11': (1260, 13668), '12': (959, 13668), '13': (1078, 13668), '14': (1240, 13668), '15': (1110, 13668)}
num_slice: 16
names: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15']
obs: ['batch']
var: []
relationship: continuous
var_type: intersect to 13668
mss: []

Preprocessing#

Quality control#

scope and mode are crucial parameters in basic multi-sample analysis and correlated funcitons.

  • scope , similar to list, means which samples used for analysis.

  • mode, like a switch, shows that analysis is performed on single sample or multi samples, integrate and isolated. It is easy to distinguish processing modes.

[9]:
ms_data.tl.cal_qc(scope=slice_generator[:],mode='integrate')
ms_data
[2023-07-10 22:31:34][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run cal_qc
[2023-07-10 22:31:34][Stereo][18436][MainThread][8820][st_pipeline][37][INFO]: start to run cal_qc...
[2023-07-10 22:31:35][Stereo][18436][MainThread][8820][st_pipeline][40][INFO]: cal_qc end, consume time 0.0970s.
[9]:
ms_data: {'0': (482, 13668), '1': (549, 13668), '2': (598, 13668), '3': (713, 13668), '4': (744, 13668), '5': (815, 13668), '6': (925, 13668), '7': (1272, 13668), '8': (1263, 13668), '9': (1248, 13668), '10': (1039, 13668), '11': (1260, 13668), '12': (959, 13668), '13': (1078, 13668), '14': (1240, 13668), '15': (1110, 13668)}
num_slice: 16
names: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15']
obs: ['batch', 'total_counts', 'n_genes_by_counts', 'pct_counts_mt']
var: ['n_cells', 'n_counts', 'mean_umi']
relationship: continuous
var_type: intersect to 13668
mss: []
[10]:
ms_data.plt.violin()
[2023-07-10 22:31:35][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run violin
[10]:
../_images/Tutorials%28Multi-sample%29_MSData_Clustering_23_3.png

QC results of previous step are saved into MSData. When you want to calculate them on several certain samples, work as below:

[11]:
ms_data.tl.cal_qc(scope=slice_generator[0:2],mode='isolated')
[2023-07-10 22:31:35][Stereo][18436][Thread-20][2712][ms_pipeline][138][INFO]: data_obj(idx=0) in ms_data start to run cal_qc
[2023-07-10 22:31:35][Stereo][18436][Thread-20][2712][st_pipeline][37][INFO]: start to run cal_qc...
[2023-07-10 22:31:35][Stereo][18436][Thread-21][21272][ms_pipeline][138][INFO]: data_obj(idx=1) in ms_data start to run cal_qc
[2023-07-10 22:31:35][Stereo][18436][Thread-21][21272][st_pipeline][37][INFO]: start to run cal_qc...
[2023-07-10 22:31:35][Stereo][18436][Thread-20][2712][st_pipeline][40][INFO]: cal_qc end, consume time 0.0750s.
[2023-07-10 22:31:35][Stereo][18436][Thread-21][21272][st_pipeline][40][INFO]: cal_qc end, consume time 0.0410s.
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    0.0s finished

Works run in parallel on each selected data object and the corresponding result is also updated into itself.

[12]:
ms_data[0]
[12]:
AnnData object with n_obs × n_vars = 482 × 13668
    obs: 'slice_ID', 'raw_x', 'raw_y', 'new_x', 'new_y', 'new_z', 'annotation', 'batch', 'total_counts', 'n_genes_by_counts', 'pct_counts_mt'
    var: 'n_cells', 'n_counts', 'mean_umi'
    uns: 'bin_type', 'bin_size', 'sn'
    obsm: 'X_umap', 'spatial', 'spatial_elas', 'spatial_rigid'
    layers: 'raw_counts'
[13]:
ms_data[0].plt.violin()
[13]:
../_images/Tutorials%28Multi-sample%29_MSData_Clustering_28_2.png

Show the spatial scatter figure of QC distribution. In multi-sample analysis, serveral parameters are added here for better visual presentation.

[14]:
ms_data.plt.spatial_scatter(
            scope=slice_generator[:],
            mode='integrate',
            plotting_scale_width=10,          # the width of scale
            reorganize_coordinate=4,          # the number of plots in each row
            horizontal_offset_additional=20,  # adjustment for horizontal distance
            vertical_offset_additional=20     # adjustment for vertical distance
            )
[2023-07-10 22:31:36][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run spatial_scatter
[14]:
../_images/Tutorials%28Multi-sample%29_MSData_Clustering_30_3.png

Filtering#

Three basic methods are provided to filter data maxtrix:

  • data.tl.filter_cells,

  • data.tl.filter_genes,

  • data.tl.filter_coordinates.

Literally, you could filter data on three optional levels: cell, gene and coordinate. Filter data based on quality control indicators which have been calculated in QC part.

Note

Demo datas used here are all elaborately processed beforehand, so that filtering and normalization will not performed in this tutorial.

[15]:
# ms_data.tl.filter_cells(min_gene=20, min_n_genes_by_counts=3, pct_counts_mt=5, scope=slice_generator[:], mode='integrate', inplace=True)

We strongly suggest to use self.raw to record the raw gene expression matrix which has been gone through basic processing, as an essential data set for subsequent differential testing and multiple analysis. When you want to get raw data, just run ms_data.tl.reset_raw_data().

[16]:
ms_data.tl.raw_checkpoint()
ms_data.tl.raw
[2023-07-10 22:31:37][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run raw_checkpoint
[16]:
<function stereo.core.ms_pipeline.MSDataPipeLine.__getattr__.<locals>.temp(*args, **kwargs)>

Normalization#

In this module, you can choose from following common methods of standardization:

Note

If the parameter inplace is set to True by default, expression matrix data will be replaced by the corresponding result (here replaced by the normalized result), otherwise unchanged.

Run a combination method of normalize_total and log1p to normalize gene expression matrix as below:

[17]:
# ms_data.tl.normalize_total(target_sum=10000)
# ms_data.tl.log1p()

If you use ms_data.tl.sctransform which includes the function of finding highly variable genes, you do not need to run ms_data.tl.highly_variable_genes. In the subsequent ms_data.tl.pca method, the parameter use_highly_genes has to be set as False. In brief, whether to use highly variable genes to run PCA depends on filter_hvgs in the normalization of scTransform. Learn more about scTransform.

Highly variable genes#

Identify highly variable genes in cells.

[18]:
ms_data.tl.highly_variable_genes(
        min_mean=0.0125,
        max_mean=3,
        min_disp=0.5,
        n_top_genes=2000,
        res_key='highly_variable_genes',
        scope=slice_generator[:],
        mode='integrate'
        )
[2023-07-10 22:31:37][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run highly_variable_genes
[2023-07-10 22:31:37][Stereo][18436][MainThread][8820][st_pipeline][37][INFO]: start to run highly_variable_genes...
[2023-07-10 22:31:38][Stereo][18436][MainThread][8820][st_pipeline][40][INFO]: highly_variable_genes end, consume time 0.4050s.
[19]:
# remember to choose a res_key when plot
ms_data.plt.highly_variable_genes(res_key='highly_variable_genes')
[2023-07-10 22:31:38][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run highly_variable_genes
[19]:
../_images/Tutorials%28Multi-sample%29_MSData_Clustering_46_3.png

Scale each gene to unit variance. Clip values exceeding standard deviation 10. If data.tl.scale(zero_center=False) is used, sparse matrix will be used for calculation, which can greatly reduce the memory required for running.

[20]:
# ms_data.tl.scale(max_value=10, zero_center=True)

Embedding#

PCA (Principal component analysis)#

As a statistical technique for reducing dimensionality of a data set, PCA finds the max axes of greatest variation, which preserve as much information as possible. Notice that if set parameter use_highly_genes to True, only highly variable genes are used to run.

[21]:
ms_data.tl.pca(
        use_highly_genes=False,
        n_pcs=30,
        res_key='pca',
        scope=slice_generator[:],
        mode='integrate'
        )
[2023-07-10 22:31:39][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run pca
[2023-07-10 22:31:39][Stereo][18436][MainThread][8820][st_pipeline][37][INFO]: start to run pca...
[2023-07-10 22:31:39][Stereo][18436][MainThread][8820][dim_reduce][77][WARNING]: svd_solver: auto can not be used with sparse input.
Use "arpack" (the default) instead.
[2023-07-10 22:31:44][Stereo][18436][MainThread][8820][st_pipeline][40][INFO]: pca end, consume time 5.0780s.
[22]:
ms_data
[22]:
ms_data: {'0': (482, 13668), '1': (549, 13668), '2': (598, 13668), '3': (713, 13668), '4': (744, 13668), '5': (815, 13668), '6': (925, 13668), '7': (1272, 13668), '8': (1263, 13668), '9': (1248, 13668), '10': (1039, 13668), '11': (1260, 13668), '12': (959, 13668), '13': (1078, 13668), '14': (1240, 13668), '15': (1110, 13668)}
num_slice: 16
names: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15']
obs: ['batch', 'total_counts', 'n_genes_by_counts', 'pct_counts_mt']
var: ['n_cells', 'n_counts', 'mean_umi', 'hvgs']
relationship: continuous
var_type: intersect to 13668
mss: ["scope_[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]:['highly_variable_genes', 'pca']"]

Neighborhood graph#

After PCA, we compute the neighborhood graph of cells using the PCA representation of the expression matrix.

[23]:
ms_data.tl.neighbors(
        pca_res_key='pca',
        n_pcs=30,
        res_key='neighbors',
        scope=slice_generator[:],
        mode='integrate'
        )

# compute spatial neighbors
# ms_data.tl.spatial_neighbors(
#        neighbors_res_key='neighbors',
#        res_key='spatial_neighbors',
#        scope=slice_generator[:],
#        mode='integrate'
#        )
[2023-07-10 22:31:44][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run neighbors
[2023-07-10 22:31:44][Stereo][18436][MainThread][8820][st_pipeline][37][INFO]: start to run neighbors...
[2023-07-10 22:32:14][Stereo][18436][MainThread][8820][st_pipeline][40][INFO]: neighbors end, consume time 30.4820s.

In addition, we also provide ms_data.tl.spatial_neighbors to compute a spatial neighbors graph.

UMAP#

It’s strongly to suggest embedding the graph in two dimensions using UMAP.

[24]:
ms_data.tl.umap(
            pca_res_key='pca',
            neighbors_res_key='neighbors',
            res_key='umap',
            scope=slice_generator[:],
            mode='integrate'
            )
[2023-07-10 22:32:14][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run umap
[2023-07-10 22:32:14][Stereo][18436][MainThread][8820][st_pipeline][37][INFO]: start to run umap...
        completed  0  /  200 epochs
        completed  20  /  200 epochs
        completed  40  /  200 epochs
        completed  60  /  200 epochs
        completed  80  /  200 epochs
        completed  100  /  200 epochs
        completed  120  /  200 epochs
        completed  140  /  200 epochs
        completed  160  /  200 epochs
        completed  180  /  200 epochs
[2023-07-10 22:32:26][Stereo][18436][MainThread][8820][st_pipeline][40][INFO]: umap end, consume time 11.4150s.

Clustering#

Currently we provide three common clustering methods, including Leiden, Louvain and Phenograph.

In this tool, you can re-run the normalization method before clustering if the parameter normalize_method is not None. Then by default, we perform PCA to reduce the dimensionalites of the new normalization result, and use top 30 pcs to run clustering.

Leiden#

Simply run:

[25]:
ms_data.tl.leiden(
            neighbors_res_key='neighbors',
            res_key='leiden',
            scope=slice_generator[:],
            mode='integrate'
            )
[2023-07-10 22:32:26][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run leiden
[2023-07-10 22:32:26][Stereo][18436][MainThread][8820][st_pipeline][37][INFO]: start to run leiden...
[2023-07-10 22:32:29][Stereo][18436][MainThread][8820][st_pipeline][40][INFO]: leiden end, consume time 2.8971s.

Show the spatial distribution of UMAP and Leiden clustering.

[26]:
ms_data.plt.umap(
            cluster_key='leiden',
            res_key='umap',
            scope=slice_generator[:],
            mode='integrate'
            )
[2023-07-10 22:32:29][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run umap
[26]:
../_images/Tutorials%28Multi-sample%29_MSData_Clustering_68_3.png
[27]:
ms_data.plt.cluster_scatter(
            res_key='leiden',
            scope=slice_generator[:],
            mode='integrate',
            plotting_scale_width=10,          # the width of scale
            reorganize_coordinate=4,          # the number of plots in each row
            horizontal_offset_additional=20,  # adjustment for horizontal distance
            vertical_offset_additional=20     # adjustment for vertical distance
            )
[2023-07-10 22:32:30][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run cluster_scatter
[27]:
../_images/Tutorials%28Multi-sample%29_MSData_Clustering_69_3.png

Subset of samples#

When it comes to analysis the subset of MSData, a simple case is shown below, using only the first four pieces ([0:4]) of data.

[28]:
# hvg
ms_data.tl.highly_variable_genes(
        min_mean=0.0125,
        max_mean=3,
        min_disp=0.5,
        n_top_genes=2000,
        res_key='highly_variable_genes_0123',
        scope=slice_generator[:4],
        mode='integrate'
        )
# pca
ms_data.tl.pca(
        use_highly_genes=False,
        n_pcs=30,
        res_key='pca_0123',
        scope=slice_generator[:4],
        mode='integrate'
        )

# neighbors
ms_data.tl.neighbors(
        pca_res_key='pca_0123',
        n_pcs=30,
        res_key='neighbors_0123',
        scope=slice_generator[:4],
        mode='integrate'
        )

# umap
ms_data.tl.umap(
            pca_res_key='pca_0123',
            neighbors_res_key='neighbors_0123',
            res_key='umap_0123',
            scope=slice_generator[:4],
            mode='integrate'
            )

# leiden
ms_data.tl.leiden(
            neighbors_res_key='neighbors_0123',
            res_key='leiden_0123',
            scope=slice_generator[:4],
            mode='integrate'
            )

# clustering plot
ms_data.plt.cluster_scatter(
            res_key='leiden_0123',
            scope=slice_generator[:4],
            mode='integrate',
            plotting_scale_width=10,          # the width of scale
            reorganize_coordinate=2,          # the number of plots in each row
            horizontal_offset_additional=20,  # adjustment for horizontal distance
            vertical_offset_additional=20     # adjustment for vertical distance
            )
[2023-07-10 22:32:30][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run highly_variable_genes
[2023-07-10 22:32:30][Stereo][18436][MainThread][8820][st_pipeline][37][INFO]: start to run highly_variable_genes...
[2023-07-10 22:32:31][Stereo][18436][MainThread][8820][st_pipeline][40][INFO]: highly_variable_genes end, consume time 0.2230s.
[2023-07-10 22:32:31][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run pca
[2023-07-10 22:32:31][Stereo][18436][MainThread][8820][st_pipeline][37][INFO]: start to run pca...
[2023-07-10 22:32:31][Stereo][18436][MainThread][8820][dim_reduce][77][WARNING]: svd_solver: auto can not be used with sparse input.
Use "arpack" (the default) instead.
[2023-07-10 22:32:33][Stereo][18436][MainThread][8820][st_pipeline][40][INFO]: pca end, consume time 2.3720s.
[2023-07-10 22:32:33][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run neighbors
[2023-07-10 22:32:33][Stereo][18436][MainThread][8820][st_pipeline][37][INFO]: start to run neighbors...
[2023-07-10 22:32:35][Stereo][18436][MainThread][8820][st_pipeline][40][INFO]: neighbors end, consume time 2.0860s.
[2023-07-10 22:32:35][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run umap
[2023-07-10 22:32:35][Stereo][18436][MainThread][8820][st_pipeline][37][INFO]: start to run umap...
        completed  0  /  500 epochs
        completed  50  /  500 epochs
        completed  100  /  500 epochs
        completed  150  /  500 epochs
        completed  200  /  500 epochs
        completed  250  /  500 epochs
        completed  300  /  500 epochs
        completed  350  /  500 epochs
        completed  400  /  500 epochs
        completed  450  /  500 epochs
[2023-07-10 22:32:42][Stereo][18436][MainThread][8820][st_pipeline][40][INFO]: umap end, consume time 6.5130s.
[2023-07-10 22:32:42][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run leiden
[2023-07-10 22:32:42][Stereo][18436][MainThread][8820][st_pipeline][37][INFO]: start to run leiden...
[2023-07-10 22:32:42][Stereo][18436][MainThread][8820][pipeline_utils][18][WARNING]:
            The function cell_cluster_to_gene_exp_cluster must be based on raw data.
            Please run data.tl.raw_checkpoint() before Normalization.

[2023-07-10 22:32:42][Stereo][18436][MainThread][8820][st_pipeline][40][INFO]: leiden end, consume time 0.1910s.
[2023-07-10 22:32:43][Stereo][18436][MainThread][8820][ms_pipeline][106][INFO]: data_obj(idx=0) in ms_data start to run cluster_scatter
[28]:
../_images/Tutorials%28Multi-sample%29_MSData_Clustering_72_5.png

Output file#

Save your data into a HDF5 file, including metadata of each sample object and MSData. You could get it back using st.io.read_h5ms().

[29]:
st.io.write_h5ms(ms_data,output='./test.h5ms')

To integrate#

When dataset has been annotated indiviually in isolated mode, you could pass cell type prediction from single sample to multi-sample for subsequenty analysis.

[30]:
ms_data[0]
[30]:
AnnData object with n_obs × n_vars = 482 × 13668
    obs: 'slice_ID', 'raw_x', 'raw_y', 'new_x', 'new_y', 'new_z', 'annotation', 'batch', 'total_counts', 'n_genes_by_counts', 'pct_counts_mt'
    var: 'n_cells', 'n_counts', 'mean_umi'
    uns: 'bin_type', 'bin_size', 'sn'
    obsm: 'X_umap', 'spatial', 'spatial_elas', 'spatial_rigid'
    layers: 'raw_counts'
[31]:
ms_data[1]
[31]:
AnnData object with n_obs × n_vars = 549 × 13668
    obs: 'slice_ID', 'raw_x', 'raw_y', 'new_x', 'new_y', 'new_z', 'annotation', 'batch', 'total_counts', 'n_genes_by_counts', 'pct_counts_mt'
    var: 'n_cells', 'n_counts', 'mean_umi'
    uns: 'bin_type', 'bin_size', 'sn'
    obsm: 'X_umap', 'spatial', 'spatial_elas', 'spatial_rigid'
    layers: 'raw_counts'

Obstain annotation information in obs from each slice, and pass them to MSData with on scope[0,1] with new names id. Note that NaN will be used to fill the rest of id column by default.

[32]:
ms_data.to_integrate(scope=slice_generator[0:2],res_key='id',_from=slice_generator[0:2],type='obs',item=['annotation','annotation'])

Check the annotation passed to MSData.

[34]:
ms_data.obs
[34]:
id batch total_counts n_genes_by_counts pct_counts_mt
E14-16h_a_S01_20500x62780-0-0 CNS 0 723.957825 550 0.0
E14-16h_a_S01_20500x62800-0-0 CNS 0 722.087219 573 0.0
E14-16h_a_S01_20500x62820-0-0 epidermis 0 768.790100 666 0.0
E14-16h_a_S01_20500x62840-0-0 CNS 0 806.373718 794 0.0
E14-16h_a_S01_20500x62860-0-0 CNS 0 818.378723 815 0.0
... ... ... ... ... ...
E14-16h_a_S16_61760x79320-15-15 NaN 15 658.930420 423 0.0
E14-16h_a_S16_61760x79340-15-15 NaN 15 643.020813 410 0.0
E14-16h_a_S16_61760x79360-15-15 NaN 15 643.981323 411 0.0
E14-16h_a_S16_61760x79380-15-15 NaN 15 613.082153 379 0.0
E14-16h_a_S16_61760x79400-15-15 NaN 15 626.735046 391 0.0

15295 rows × 5 columns