Batch Effect#

This part shows how to integrate data of different batches. As integration works by adjusting the principal components, this function should be performed after PCA.

Integration#

Reading data#

Two batches of experiment data stored in GEM files are uesd here, which are lasso results from their original GEF file.

Please download our example data.

[2]:

import stereo as st
import warnings
warnings.filterwarnings('ignore')

input_file_1 = './SS200000132BR_A1.bin1.Lasso.gem.gz'
input_file_2 = './SS200000132BR_A2.bin1.Lasso.gem.gz'

data1 = st.io.read_gem(input_file_1)
data2 = st.io.read_gem(input_file_2)

Preprocessing#

Before integrating batches, we need do basic preprocessing for each batch, including quality control and filtering.

[3]:

data1.tl.cal_qc()
data2.tl.cal_qc()

Draw scatter plots to observe the distribution, and set appropriate parameters. Filter cells or genes by st.tl.filter_cells or st.tl.filter_genes.

[4]:

# plot before filtering

data1.plt.genes_count()

../_images/Tutorials%28Multi-sample%29_Batch_Effect_10_8.png

[5]:

# plot before filtering

data2.plt.genes_count()

../_images/Tutorials%28Multi-sample%29_Batch_Effect_11_0.png

[6]:

# filter cells

data1.tl.filter_cells(max_n_genes_by_counts=4000, pct_counts_mt=5, inplace=True)
data2.tl.filter_cells(max_n_genes_by_counts=4500, pct_counts_mt=7, inplace=True)

[6]:

StereoExpData object with n_cells X n_genes = 10516 X 24966
bin_type: bins
bin_size: 100
offset_x = 3975
offset_y = 8625
cells: ['cell_name', 'total_counts', 'n_genes_by_counts', 'pct_counts_mt']
genes: ['gene_name']

Merging#

Merge all batches of data. The method supports any amount of data.

[7]:

data = st.utils.data_helper.merge(data1, data2)

# check the shape of merged data
data.shape

[7]:

(21375, 23750)

Normalization#

Normalization should be finished before PCA. We usually run a combination of normalize_total and log1p.

[8]:

# Since normalization will change the expression matrix, save raw data beforehand.
# data.tl.raw_checkpoint()

data.tl.normalize_total()
data.tl.log1p()

PCA#

[9]:

data.tl.pca(use_highly_genes=False, n_pcs=50, res_key='pca')

Integrating#

Integrate data from different batches which have been merged before, through adjusting PCA result specified by pca_res_key, and rememer to label composite PCA result stored into res_key.

[10]:

data.tl.batches_integrate(pca_res_key='pca', res_key='pca_integrated')

UMAP#

After integrating, we can perform UMAP base on the composite PCA result.

[11]:

data.tl.neighbors(pca_res_key='pca_integrated', n_pcs=50, res_key='neighbors_integrated')
data.tl.umap(pca_res_key='pca_integrated', neighbors_res_key='neighbors_integrated', res_key='umap_integrated')
data.plt.batches_umap(res_key='umap_integrated')

        completed  0  /  200 epochs
        completed  20  /  200 epochs
        completed  40  /  200 epochs
        completed  60  /  200 epochs
        completed  80  /  200 epochs
        completed  100  /  200 epochs
        completed  120  /  200 epochs
        completed  140  /  200 epochs
        completed  160  /  200 epochs
        completed  180  /  200 epochs

[11]:

Clustering#

[12]:

data.tl.leiden(neighbors_res_key='neighbors_integrated', res_key='leiden')
data.plt.cluster_scatter(res_key='leiden')

[12]:

<AxesSubplot:>

../_images/Tutorials%28Multi-sample%29_Batch_Effect_28_1.png

Note The following part will intentionally show you the significant difference between results of after integrating and without itegrating.

Without integration#

Compare the UMAP result with before for inspecting the effect of integration.

Embedding#

[13]:

data.tl.neighbors(pca_res_key='pca', n_pcs=50, res_key='neighbors', n_jobs=-1)
data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap')
data.plt.batches_umap(res_key='umap')

        completed  0  /  200 epochs
        completed  20  /  200 epochs
        completed  40  /  200 epochs
        completed  60  /  200 epochs
        completed  80  /  200 epochs
        completed  100  /  200 epochs
        completed  120  /  200 epochs
        completed  140  /  200 epochs
        completed  160  /  200 epochs
        completed  180  /  200 epochs

[13]:

Datasets of two batches are distributed differently on UMAP space, which can be separated from each other obviously.

Clustering#

We can also perform a clustering for a further comparation.

[14]:

data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')

[15]:

data.plt.cluster_scatter(res_key='leiden')

[15]:

<AxesSubplot:>

../_images/Tutorials%28Multi-sample%29_Batch_Effect_38_1.png