scTransform¶
As a single-cell RNA sequencing transform method, scTransform uses regularized negative binomial regression to normalize the express matrix of UMI [Hafemeister19].
Differences between methods¶
Before exploring scTransform, let’s review what classic normalization does.
[1]:
import sys
import stereo as st
import pandas as pd
import numpy as np
# read data
data1 = st.io.read_gef('./SS200000135TL_D1.tissue.gef')
data1.sparse2array()
gmean = np.exp(np.log(data1.exp_matrix.T + 1).mean(1)) - 1
# preprocessing
data1.tl.raw_checkpoint()
data1.tl.normalize_total(target_sum=1e4)
data1.tl.log1p()
log_normalize_result = pd.DataFrame([gmean, data1.exp_matrix.T.var(1)], index=['gmean', 'log_normalize_variance'], columns=data1.gene_names).T
from stereo.algorithm.sctransform.plotting import plot_log_normalize_var
fig1=plot_log_normalize_var(log_normalize_result)
After log1p normalization, it is apparently observed that lowly expressed genes contribute just a little variance in this sample.
[2]:
data2 = st.io.read_gef('./SS200000135TL_D1.tissue.gef')
data2.tl.sctransform(res_key='sctransform', inplace=True, filter_hvgs=True)
from stereo.algorithm.sctransform.plotting import plot_residual_var
fig2=plot_residual_var(data2.tl.result['sctransform'])
Whereas, after scTransform, gene express matrix is transformed from raw counts to Pearson residual. Different with 1og1p normalization, scTransform balances variance distribution of all genes, which means that not only highly expressed genes make sense, so do the lowly expressed genes.
Let us take some genes from a real dataset after normalization via scTransform, and compare their variance distribution to that normalized by log1p.
[3]:
data3 = st.io.read_gef('./SS200000135TL_D1.tissue.gef')
data3.tl.cal_qc()
data3.plt.spatial_scatter_by_gene(gene_name='Th')
from stereo.algorithm.sctransform.plotting import plot_genes_var_contribution
fig3=plot_genes_var_contribution(data3, gene_names=['Ptgds','Hbb-bs', 'Kcnip4', 'Gm28928', 'Trpm3', 'Th'])