Batch Effect#
This part shows how to integrate data of different batches. As integration works by adjusting the principal components, this function should be performed after PCA.
Integration#
Reading data#
Two batches of experiment data stored in GEM files are uesd here, which are lasso results from their original GEF file.
Please download our example data.
[2]:
import stereo as st
import warnings
warnings.filterwarnings('ignore')
input_file_1 = './SS200000132BR_A1.bin1.Lasso.gem.gz'
input_file_2 = './SS200000132BR_A2.bin1.Lasso.gem.gz'
data1 = st.io.read_gem(input_file_1)
data2 = st.io.read_gem(input_file_2)
Preprocessing#
Before integrating batches, we need do basic preprocessing for each batch, including quality control and filtering.
[3]:
data1.tl.cal_qc()
data2.tl.cal_qc()
Draw scatter plots to observe the distribution, and set appropriate parameters. Filter cells or genes by st.tl.filter_cells
or st.tl.filter_genes
.
[4]:
# plot before filtering
data1.plt.genes_count()
[5]:
# plot before filtering
data2.plt.genes_count()
[6]:
# filter cells
data1.tl.filter_cells(max_n_genes_by_counts=4000, pct_counts_mt=5, inplace=True)
data2.tl.filter_cells(max_n_genes_by_counts=4500, pct_counts_mt=7, inplace=True)
[6]:
StereoExpData object with n_cells X n_genes = 10516 X 24966
bin_type: bins
bin_size: 100
offset_x = 3975
offset_y = 8625
cells: ['cell_name', 'total_counts', 'n_genes_by_counts', 'pct_counts_mt']
genes: ['gene_name']
Merging#
Merge all batches of data. The method supports any amount of data.
[7]:
data = st.utils.data_helper.merge(data1, data2)
# check the shape of merged data
data.shape
[7]:
(21375, 23750)
Normalization#
Normalization should be finished before PCA. We usually run a combination of normalize_total
and log1p
.
[8]:
# Since normalization will change the expression matrix, save raw data beforehand.
# data.tl.raw_checkpoint()
data.tl.normalize_total()
data.tl.log1p()
PCA#
[9]:
data.tl.pca(use_highly_genes=False, n_pcs=50, res_key='pca')
Integrating#
Integrate data from different batches which have been merged before, through adjusting PCA result specified by pca_res_key
, and rememer to label composite PCA result stored into res_key
.
[10]:
data.tl.batches_integrate(pca_res_key='pca', res_key='pca_integrated')
UMAP#
After integrating, we can perform UMAP base on the composite PCA result.
[11]:
data.tl.neighbors(pca_res_key='pca_integrated', n_pcs=50, res_key='neighbors_integrated')
data.tl.umap(pca_res_key='pca_integrated', neighbors_res_key='neighbors_integrated', res_key='umap_integrated')
data.plt.batches_umap(res_key='umap_integrated')
completed 0 / 200 epochs
completed 20 / 200 epochs
completed 40 / 200 epochs
completed 60 / 200 epochs
completed 80 / 200 epochs
completed 100 / 200 epochs
completed 120 / 200 epochs
completed 140 / 200 epochs
completed 160 / 200 epochs
completed 180 / 200 epochs
[11]:
Clustering#
[12]:
data.tl.leiden(neighbors_res_key='neighbors_integrated', res_key='leiden')
data.plt.cluster_scatter(res_key='leiden')
[12]:
<AxesSubplot:>
Note
The following part will intentionally show you the significant difference between results of after integrating and without itegrating.
Without integration#
Compare the UMAP result with before for inspecting the effect of integration.
Embedding#
[13]:
data.tl.neighbors(pca_res_key='pca', n_pcs=50, res_key='neighbors', n_jobs=-1)
data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap')
data.plt.batches_umap(res_key='umap')
completed 0 / 200 epochs
completed 20 / 200 epochs
completed 40 / 200 epochs
completed 60 / 200 epochs
completed 80 / 200 epochs
completed 100 / 200 epochs
completed 120 / 200 epochs
completed 140 / 200 epochs
completed 160 / 200 epochs
completed 180 / 200 epochs
[13]:
Datasets of two batches are distributed differently on UMAP space, which can be separated from each other obviously.
Clustering#
We can also perform a clustering for a further comparation.
[14]:
data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')
[15]:
data.plt.cluster_scatter(res_key='leiden')
[15]:
<AxesSubplot:>