Batch QC#
This part shows how to perform quality control on batch effection.
As genomic sequencing technology develops, multi-batch joint analysis of gene expression data can maximize the scientific value in the data set, supporting researchers in discovering more significant biological topics. However, joint analysis usually misses batch effects caused by abiotic deviations such as external noise in integrated data. When the batch effect makes an impact on the results of an experiment, the downstream analytical conclusion, if not the wrong experimental result, could have been incorrect.
As a result, prior to data integration and processing, the batch effects in the data should be thoroughly investigated. We developed the BatchEval Pipeline, which is suitable for massive data integration, for evaluating batch effects in data and provide examination conclusions. The BatchEval Pipeline analyses multiple batches of data from multiple perspectives to detect how it affects of batch effects on integrated data. BatchEval Pipeline generates an HTML webpage report output for the assessment findings, which includes a basic explanation of the data, statistical analysis, a batch effect metric score, and visualization. BatchEval Pipeline analyses integrated data from multiple perspectives to determine whether it has to be corrected in an additional step employing batch effect removal approaches and how to do so, which is essential for downstream analysis [Zhang23].
MSData construction#
Here we use demo data of olfactory bulb. Download the example data first.
[1]:
import stereo as st
from stereo.core.ms_data import MSData
from stereo.core.ms_pipeline import slice_generator
import warnings
warnings.filterwarnings('ignore')
# prepara for input directory
data1 = st.io.read_h5ad('./Demo_BatchQC/stereo_olfactory_bulb_ann.h5ad',bin_type='bins', bin_size=1)
data2 = st.io.read_h5ad('./Demo_BatchQC/visium_olfactory_bulb_ann.h5ad',bin_type='bins', bin_size=1)
# construct MSData object
ms_data = MSData(_relationship='continuous', _var_type='intersect')
ms_data += data1
ms_data += data2
After loading data, you need to perform integration first, the parameter space_between
represents the distance between each adjacent pair, which will be used for z-coordinate of each sample, default to 0
.
[4]:
ms_data.integrate()
ms_data
[4]:
ms_data: {'0': (1123, 27825), '1': (1184, 32285)}
num_slice: 2
names: ['0', '1']
obs: ['batch']
var: []
relationship: continuous
var_type: intersect to 26760
mss: []
Clustering#
[6]:
# preprocessing
ms_data.tl.cal_qc(scope=slice_generator[:],mode='integrate')
ms_data.tl.raw_checkpoint()
ms_data.tl.normalize_total(scope=slice_generator[:],mode='integrate')
ms_data.tl.log1p(scope=slice_generator[:],mode='integrate')
ms_data.tl.scale(scope=slice_generator[:],mode='integrate', zero_center=False, max_value=10)
# embedding
ms_data.tl.pca(scope=slice_generator[:],mode='integrate', use_highly_genes=False, n_pcs=50, res_key='pca')
ms_data.tl.neighbors(scope=slice_generator[:],mode='integrate', pca_res_key='pca', res_key='neighbors')
ms_data.tl.umap(scope=slice_generator[:],mode='integrate', pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap')
# clustering
ms_data.tl.leiden(scope=slice_generator[:],mode='integrate', neighbors_res_key='neighbors', res_key='leiden')
[2023-06-29 18:13:52][Stereo][31507][MainThread][140518837950272][ms_pipeline][105][INFO]: data_obj(idx=0) in ms_data start to run leiden
[2023-06-29 18:13:52][Stereo][31507][MainThread][140518837950272][st_pipeline][37][INFO]: start to run leiden...
[2023-06-29 18:13:52][Stereo][31507][MainThread][140518837950272][st_pipeline][40][INFO]: leiden end, consume time 0.4925s.
Batch evaluation#
Evaluate batch effection, and save the output report.
[7]:
ms_data.tl.batch_qc(scope=slice_generator[:],mode='integrate', cluster_res_key='leiden', report_path='./batch_qc', res_key='batch_qc')
[2023-06-29 18:13:54][Stereo][31507][MainThread][140518837950272][ms_pipeline][95][INFO]: register algorithm batch_qc to <class 'stereo.core.stereo_exp_data.StereoExpData'>-140516686761456
[2023-06-29 18:14:11][Stereo][31507][MainThread][140518837950272][classifier][136][INFO]: Model Training Finished!
[2023-06-29 18:14:11][Stereo][31507][MainThread][140518837950272][classifier][137][INFO]: Trained checkpoint file has been saved to ./batch_qc
[2023-06-29 18:15:46][Stereo][31507][MainThread][140518837950272][batchqc_raw][249][INFO]: 2023-06-29 18:15:46 The 'BatchQC_Report.html' has been saved to ./batch_qc
Display report#
[8]:
ms_data.plt.show_batch_qc_report(scope=slice_generator[:], mode='integrate', res_key='batch_qc')
[2023-06-29 18:15:46][Stereo][31507][MainThread][140518837950272][ms_pipeline][102][INFO]: register plot_func show_batch_qc_report to <class 'stereo.core.stereo_exp_data.StereoExpData'>-140516686761456
Batch Effect QC Report
Report By: root
Report Time: 2023-06-29 18:15:46
Data Information
Basic Describe
· Number of samples in each Batch and Condition
Batch0 | Batch1 | |
---|---|---|
Condition1 | 1123 | 0 |
Condition2 | 0 | 1184 |
Total | 1123 | 1184 |
Describe Note | 'condition' is the different control variables designed in the experiment. By default, a batch of data studies the same control variable. |
· Variation Analysis (F Test) of Each Data
N_Batch | N_Sample | F | P Value | F Ref (1, 2305) | |
---|---|---|---|---|---|
Score | 2.0000 | 2307.0000 | 273.6777 | 0.0000 | 3.8455 |
· K-S Test of Each Pair Batch
N-Samples | K-S Stat | P-Value | |
---|---|---|---|
Batch0-1 | 2307.0000 | 0.1571 | 0.0000 |
Describe Note | Kolmogorov-smirnov test (K-S Test) checks whether two data distributions are consistent. |
· Domain Variance Score
N_Batch | N_Sample | Train Size | Accept Rate | |
---|---|---|---|---|
Domain Variance | 2.0000 | 2307.0000 | 1614.0000 | 0.0000 |
· Measures of confounding between Batch and Condition
Pearson Correlation Coefficient | Cramer'S V Coefficient | |
---|---|---|
Confounding Coefficients | 0.8167 | 0.7074 |
Describe Note | 0=No Confounding, 1=Complete Confounding. The larger the value, the higher the correlation between data batch and experimental conditions. |
· Distribution Visualization
Kernel Distribution Curve Of Umicount Total | Cdf Curve Of Umicount Total |
Box-Plot Of Umicount Total | Umicount Scatter Plot For Each Spot |
Metric Score
Raw
· K Nearest Neighbor Batch Effects Test
Chi Mean | 95% P Value | Accept Rate | Reject Rate | |
---|---|---|---|---|
Score | 100.0000 | 0.0000 | 0.0000 | 1.0000 |
Describe Note | Local area Chi2 test. Local sample richness test. By default, UMAP coordinates are used to calculate region neighbors. default neighbors: 100 |
· Local Inverse Simpson's Index
Lisi Mean | 95%Ci Lower | 95%Ci Upper | |
---|---|---|---|
Batch | 1.0000 | 1.0000 | 1.0000 |
Leiden | 1.4006 | 1.3827 | 1.4185 |
Describe Note | By default, UMAP coordinates are used to calculate region neighbors, default n_neighbor: 100. For batch, a larger value indicates that data of different batches in a local area is mixed more evenly. |
· K Nearest Neighbor Similarity
Ksim Mean | Accept Rate | |
---|---|---|
Leiden | 0.0100 | 0.0000 |
Describe Note | The kSIM acceptance rate requires ground truth cell type information and measures whether the neighbors of a cell have the same cell type as it does. If a method overcorrects the batch effects, it will have a low kSIM acceptance rate. |
Visualization
· Sample
Heatmap | Joint | Umap-Batch |
Umap-Type | ||
· Q-Q Cross Fit
Batch0-1 |
· Q-Q Fit Norm & Histogram
Batch0 | Batch1 |
Summary
Score | Adaptive Threshold | Conclusion | |
---|---|---|---|
Result | 0.0000 | 0.0500 | Need to do batch effect removal. |