Batch QC

This part shows how to perform quality control on batch effection.

As genomic sequencing technology develops, multi-batch joint analysis of gene expression data can maximize the scientific value in the data set, supporting researchers in discovering more significant biological topics. However, joint analysis usually misses batch effects caused by abiotic deviations such as external noise in integrated data. When the batch effect makes an impact on the results of an experiment, the downstream analytical conclusion, if not the wrong experimental result, could have been incorrect.

As a result, prior to data integration and processing, the batch effects in the data should be thoroughly investigated. We developed the BatchEval Pipeline, which is suitable for massive data integration, for evaluating batch effects in data and provide examination conclusions. The BatchEval Pipeline analyses multiple batches of data from multiple perspectives to detect how it affects of batch effects on integrated data. BatchEval Pipeline generates an HTML webpage report output for the assessment findings, which includes a basic explanation of the data, statistical analysis, a batch effect metric score, and visualization. BatchEval Pipeline analyses integrated data from multiple perspectives to determine whether it has to be corrected in an additional step employing batch effect removal approaches and how to do so, which is essential for downstream analysis [Zhang23].

This function can be run on GPU, if you want to use GPU, you need to create an environment according to the guide Clustering_by_GPU.

Note

Torch is the necessary dependency, we need to install it beforehand:

CPU: pip install torch==2.4.1+cpu --extra-index-url https://download.pytorch.org/whl

GPU(CUDA11): pip install torch==2.4.1+cu118 --extra-index-url https://download.pytorch.org/whl/

GPU(CUDA12): pip install torch==2.4.1+cu124 --extra-index-url https://download.pytorch.org/whl/

MSData construction

Here we use demo data of olfactory bulb. Download the example data first.

[1]:
import stereo as st
from stereo.core.ms_data import MSData
from stereo.core.ms_pipeline import slice_generator

import warnings
warnings.filterwarnings('ignore')

# prepara for input directory
data1 = st.io.read_h5ad('./Demo_BatchQC/stereo_olfactory_bulb_ann.h5ad',bin_type='bins', bin_size=1)
data2 = st.io.read_h5ad('./Demo_BatchQC/visium_olfactory_bulb_ann.h5ad',bin_type='bins', bin_size=1)

# construct MSData object
ms_data = MSData(_relationship='continuous', _var_type='intersect')

ms_data += data1
ms_data += data2

After loading data, you need to perform integration first, the parameter space_between represents the distance between each adjacent pair, which will be used for z-coordinate of each sample, default to 0.

[4]:
ms_data.integrate()
ms_data
[4]:
ms_data: {'0': (1123, 27825), '1': (1184, 32285)}
num_slice: 2
names: ['0', '1']
obs: ['batch']
var: []
relationship: continuous
var_type: intersect to 26760
mss: []

Clustering

[6]:
# preprocessing
ms_data.tl.cal_qc(scope=slice_generator[:],mode='integrate')
ms_data.tl.raw_checkpoint()

ms_data.tl.normalize_total(scope=slice_generator[:],mode='integrate')
ms_data.tl.log1p(scope=slice_generator[:],mode='integrate')
ms_data.tl.scale(scope=slice_generator[:],mode='integrate', zero_center=False, max_value=10)

# embedding
ms_data.tl.pca(scope=slice_generator[:],mode='integrate', use_highly_genes=False, n_pcs=50, res_key='pca')
ms_data.tl.neighbors(scope=slice_generator[:],mode='integrate', pca_res_key='pca', res_key='neighbors')
ms_data.tl.umap(scope=slice_generator[:],mode='integrate', pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap')

# clustering
ms_data.tl.leiden(scope=slice_generator[:],mode='integrate', neighbors_res_key='neighbors', res_key='leiden')
[2023-06-29 18:13:52][Stereo][31507][MainThread][140518837950272][ms_pipeline][105][INFO]: data_obj(idx=0) in ms_data start to run leiden
[2023-06-29 18:13:52][Stereo][31507][MainThread][140518837950272][st_pipeline][37][INFO]: start to run leiden...
[2023-06-29 18:13:52][Stereo][31507][MainThread][140518837950272][st_pipeline][40][INFO]: leiden end, consume time 0.4925s.

Batch evaluation

Evaluate batch effection, and save the output report.

[7]:
ms_data.tl.batch_qc(scope=slice_generator[:],mode='integrate', cluster_res_key='leiden', report_path='./batch_qc', res_key='batch_qc')
[2023-06-29 18:13:54][Stereo][31507][MainThread][140518837950272][ms_pipeline][95][INFO]: register algorithm batch_qc to <class 'stereo.core.stereo_exp_data.StereoExpData'>-140516686761456
[2023-06-29 18:14:11][Stereo][31507][MainThread][140518837950272][classifier][136][INFO]: Model Training Finished!
[2023-06-29 18:14:11][Stereo][31507][MainThread][140518837950272][classifier][137][INFO]: Trained checkpoint file has been saved to ./batch_qc
[2023-06-29 18:15:46][Stereo][31507][MainThread][140518837950272][batchqc_raw][249][INFO]: 2023-06-29 18:15:46 The 'BatchQC_Report.html' has been saved to ./batch_qc

Display report

[8]:
ms_data.plt.show_batch_qc_report(scope=slice_generator[:], mode='integrate', res_key='batch_qc')
[2023-06-29 18:15:46][Stereo][31507][MainThread][140518837950272][ms_pipeline][102][INFO]: register plot_func show_batch_qc_report to <class 'stereo.core.stereo_exp_data.StereoExpData'>-140516686761456
BatchQC-Raw

Batch Effect QC Report

Report By: root

Report Time: 2023-06-29 18:15:46

Data Information

Basic Describe

· Number of samples in each Batch and Condition

Batch0Batch1
Condition111230
Condition201184
Total11231184
Describe Note'condition' is the different control variables designed in the experiment. By default, a batch of data studies the same control variable.

· Variation Analysis (F Test) of Each Data

N_BatchN_SampleFP ValueF Ref (1, 2305)
Score2.00002307.0000273.67770.00003.8455

· K-S Test of Each Pair Batch

N-SamplesK-S StatP-Value
Batch0-12307.00000.15710.0000
Describe NoteKolmogorov-smirnov test (K-S Test) checks whether two data distributions are consistent.

· Domain Variance Score

N_BatchN_SampleTrain SizeAccept Rate
Domain Variance2.00002307.00001614.00000.0000

· Measures of confounding between Batch and Condition

Pearson Correlation CoefficientCramer'S V Coefficient
Confounding Coefficients0.81670.7074
Describe Note0=No Confounding, 1=Complete Confounding. The larger the value, the higher the correlation between data batch and experimental conditions.

· Distribution Visualization

Kernel Distribution Curve Of Umicount TotalCdf Curve Of Umicount Total
Box-Plot Of Umicount TotalUmicount Scatter Plot For Each Spot


Metric Score

Raw

· K Nearest Neighbor Batch Effects Test

Chi Mean95% P ValueAccept RateReject Rate
Score100.00000.00000.00001.0000
Describe NoteLocal area Chi2 test. Local sample richness test. By default, UMAP coordinates are used to calculate region neighbors. default neighbors: 100

· Local Inverse Simpson's Index

Lisi Mean95%Ci Lower95%Ci Upper
Batch1.00001.00001.0000
Leiden1.40061.38271.4185
Describe NoteBy default, UMAP coordinates are used to calculate region neighbors, default n_neighbor: 100. For batch, a larger value indicates that data of different batches in a local area is mixed more evenly.

· K Nearest Neighbor Similarity

Ksim MeanAccept Rate
Leiden0.01000.0000
Describe NoteThe kSIM acceptance rate requires ground truth cell type information and measures whether the neighbors of a cell have the same cell type as it does. If a method overcorrects the batch effects, it will have a low kSIM acceptance rate.


Visualization

· Sample

HeatmapJointUmap-Batch
Umap-Type

· Q-Q Cross Fit

Batch0-1

· Q-Q Fit Norm & Histogram

Batch0Batch1


Summary

ScoreAdaptive ThresholdConclusion
Result0.00000.0500Need to do batch effect removal.