Batch QC¶

This part shows how to perform quality control on batch effection.

As genomic sequencing technology develops, multi-batch joint analysis of gene expression data can maximize the scientific value in the data set, supporting researchers in discovering more significant biological topics. However, joint analysis usually misses batch effects caused by abiotic deviations such as external noise in integrated data. When the batch effect makes an impact on the results of an experiment, the downstream analytical conclusion, if not the wrong experimental result, could have been incorrect.

As a result, prior to data integration and processing, the batch effects in the data should be thoroughly investigated. We developed the BatchEval Pipeline, which is suitable for massive data integration, for evaluating batch effects in data and provide examination conclusions. The BatchEval Pipeline analyses multiple batches of data from multiple perspectives to detect how it affects of batch effects on integrated data. BatchEval Pipeline generates an HTML webpage report output for the assessment findings, which includes a basic explanation of the data, statistical analysis, a batch effect metric score, and visualization. BatchEval Pipeline analyses integrated data from multiple perspectives to determine whether it has to be corrected in an additional step employing batch effect removal approaches and how to do so, which is essential for downstream analysis [Zhang23].

This function can be run on GPU, if you want to use GPU, you need to create an environment according to the guide Clustering_by_GPU.

Note

Torch is the necessary dependency, we need to install it beforehand:

CPU: pip install torch==2.4.1+cpu --extra-index-url https://download.pytorch.org/whl

GPU(CUDA11): pip install torch==2.4.1+cu118 --extra-index-url https://download.pytorch.org/whl/

GPU(CUDA12): pip install torch==2.4.1+cu124 --extra-index-url https://download.pytorch.org/whl/

MSData construction¶

Here we use demo data of olfactory bulb. Download the example data first.

[1]:

import stereo as st
from stereo.core.ms_data import MSData
from stereo.core.ms_pipeline import slice_generator

import warnings
warnings.filterwarnings('ignore')

# prepara for input directory
data1 = st.io.read_h5ad('./Demo_BatchQC/stereo_olfactory_bulb_ann.h5ad',bin_type='bins', bin_size=1)
data2 = st.io.read_h5ad('./Demo_BatchQC/visium_olfactory_bulb_ann.h5ad',bin_type='bins', bin_size=1)

# construct MSData object
ms_data = MSData(_relationship='continuous', _var_type='intersect')

ms_data += data1
ms_data += data2

After loading data, you need to perform integration first, the parameter space_between represents the distance between each adjacent pair, which will be used for z-coordinate of each sample, default to 0.

[4]:

ms_data.integrate()
ms_data

[4]:

ms_data: {'0': (1123, 27825), '1': (1184, 32285)}
num_slice: 2
names: ['0', '1']
obs: ['batch']
var: []
relationship: continuous
var_type: intersect to 26760
mss: []

Clustering¶

[6]:

# preprocessing
ms_data.tl.cal_qc(scope=slice_generator[:],mode='integrate')
ms_data.tl.raw_checkpoint()

ms_data.tl.normalize_total(scope=slice_generator[:],mode='integrate')
ms_data.tl.log1p(scope=slice_generator[:],mode='integrate')
ms_data.tl.scale(scope=slice_generator[:],mode='integrate', zero_center=False, max_value=10)

# embedding
ms_data.tl.pca(scope=slice_generator[:],mode='integrate', use_highly_genes=False, n_pcs=50, res_key='pca')
ms_data.tl.neighbors(scope=slice_generator[:],mode='integrate', pca_res_key='pca', res_key='neighbors')
ms_data.tl.umap(scope=slice_generator[:],mode='integrate', pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap')

# clustering
ms_data.tl.leiden(scope=slice_generator[:],mode='integrate', neighbors_res_key='neighbors', res_key='leiden')

[2023-06-29 18:13:52][Stereo][31507][MainThread][140518837950272][ms_pipeline][105][INFO]: data_obj(idx=0) in ms_data start to run leiden
[2023-06-29 18:13:52][Stereo][31507][MainThread][140518837950272][st_pipeline][37][INFO]: start to run leiden...
[2023-06-29 18:13:52][Stereo][31507][MainThread][140518837950272][st_pipeline][40][INFO]: leiden end, consume time 0.4925s.

Batch evaluation¶

Evaluate batch effection, and save the output report.

[7]:

ms_data.tl.batch_qc(scope=slice_generator[:],mode='integrate', cluster_res_key='leiden', report_path='./batch_qc', res_key='batch_qc')

[2023-06-29 18:13:54][Stereo][31507][MainThread][140518837950272][ms_pipeline][95][INFO]: register algorithm batch_qc to <class 'stereo.core.stereo_exp_data.StereoExpData'>-140516686761456
[2023-06-29 18:14:11][Stereo][31507][MainThread][140518837950272][classifier][136][INFO]: Model Training Finished!
[2023-06-29 18:14:11][Stereo][31507][MainThread][140518837950272][classifier][137][INFO]: Trained checkpoint file has been saved to ./batch_qc
[2023-06-29 18:15:46][Stereo][31507][MainThread][140518837950272][batchqc_raw][249][INFO]: 2023-06-29 18:15:46 The 'BatchQC_Report.html' has been saved to ./batch_qc

Display report¶

[8]:

ms_data.plt.show_batch_qc_report(scope=slice_generator[:], mode='integrate', res_key='batch_qc')

[2023-06-29 18:15:46][Stereo][31507][MainThread][140518837950272][ms_pipeline][102][INFO]: register plot_func show_batch_qc_report to <class 'stereo.core.stereo_exp_data.StereoExpData'>-140516686761456

BatchQC-Raw

Batch Effect QC Report

Report By: root

Report Time: 2023-06-29 18:15:46

Data Information

Basic Describe

· Number of samples in each Batch and Condition

	Batch0	Batch1
Condition1	1123	0
Condition2	0	1184
Total	1123	1184
Describe Note	'condition' is the different control variables designed in the experiment. By default, a batch of data studies the same control variable.

· Variation Analysis (F Test) of Each Data

	N_Batch	N_Sample	F	P Value	F Ref (1, 2305)
Score	2.0000	2307.0000	273.6777	0.0000	3.8455

· K-S Test of Each Pair Batch

	N-Samples	K-S Stat	P-Value
Batch0-1	2307.0000	0.1571	0.0000
Describe Note	Kolmogorov-smirnov test (K-S Test) checks whether two data distributions are consistent.

· Domain Variance Score

	N_Batch	N_Sample	Train Size	Accept Rate
Domain Variance	2.0000	2307.0000	1614.0000	0.0000

· Measures of confounding between Batch and Condition

	Pearson Correlation Coefficient	Cramer'S V Coefficient
Confounding Coefficients	0.8167	0.7074
Describe Note	0=No Confounding, 1=Complete Confounding. The larger the value, the higher the correlation between data batch and experimental conditions.

· Distribution Visualization

Kernel Distribution Curve Of Umicount Total	Cdf Curve Of Umicount Total

Box-Plot Of Umicount Total	Umicount Scatter Plot For Each Spot

Metric Score

Raw

· K Nearest Neighbor Batch Effects Test

	Chi Mean	95% P Value	Accept Rate	Reject Rate
Score	100.0000	0.0000	0.0000	1.0000
Describe Note	Local area Chi2 test. Local sample richness test. By default, UMAP coordinates are used to calculate region neighbors. default neighbors: 100

· Local Inverse Simpson's Index

	Lisi Mean	95%Ci Lower	95%Ci Upper
Batch	1.0000	1.0000	1.0000
Leiden	1.4006	1.3827	1.4185
Describe Note	By default, UMAP coordinates are used to calculate region neighbors, default n_neighbor: 100. For batch, a larger value indicates that data of different batches in a local area is mixed more evenly.

· K Nearest Neighbor Similarity

	Ksim Mean	Accept Rate
Leiden	0.0100	0.0000
Describe Note	The kSIM acceptance rate requires ground truth cell type information and measures whether the neighbors of a cell have the same cell type as it does. If a method overcorrects the batch effects, it will have a low kSIM acceptance rate.

Visualization

· Sample

Heatmap	Joint	Umap-Batch

Umap-Type

· Q-Q Cross Fit

Batch0-1

· Q-Q Fit Norm & Histogram

Batch0	Batch1

Batch QC¶

MSData construction¶

Clustering¶

Batch evaluation¶

Display report¶

Batch Effect QC Report

Report By: root

Report Time: 2023-06-29 18:15:46

Data Information

Basic Describe

· Number of samples in each Batch and Condition

· Variation Analysis (F Test) of Each Data

· K-S Test of Each Pair Batch

· Domain Variance Score

· Measures of confounding between Batch and Condition

· Distribution Visualization

Metric Score

Raw

· K Nearest Neighbor Batch Effects Test

· Local Inverse Simpson's Index

· K Nearest Neighbor Similarity

Visualization

· Sample

· Q-Q Cross Fit

· Q-Q Fit Norm & Histogram

Summary

Score Adaptive Threshold Conclusion
Result 0.0000 0.0500 Need to do batch effect removal.

Batch QC¶

MSData construction¶

Clustering¶

Batch evaluation¶

Display report¶

Batch Effect QC Report

Report By: root

Report Time: 2023-06-29 18:15:46

Data Information

Basic Describe

· Number of samples in each Batch and Condition

· Variation Analysis (F Test) of Each Data

· K-S Test of Each Pair Batch

· Domain Variance Score

· Measures of confounding between Batch and Condition

· Distribution Visualization

Metric Score

Raw

· K Nearest Neighbor Batch Effects Test

· Local Inverse Simpson's Index

· K Nearest Neighbor Similarity

Visualization

· Sample

· Q-Q Cross Fit

· Q-Q Fit Norm & Histogram

Summary

ScoreAdaptive ThresholdConclusionResult0.00000.0500Need to do batch effect removal.

Score Adaptive Threshold Conclusion
Result 0.0000 0.0500 Need to do batch effect removal.