SingleR#

Single-cell recognition (SingleR) is a tool to automatically annotate a test sample by a reference sample. The core step of SingleR is the calcualation of local spearman coefficients between query sample and the reference (transcriptomic datasets of pure cell types), in order to find the most similar label through continuous fine-tuning [Aran19]. So that it could infer the origin of each cell independently.

Note

When perform SingleR for automatic annotation, the time cost of analysis is much dicated by reference volume. Generally speaking, it takes a relatively long computational time to fulfill the analysis.

Currently, the function supports two methods: cpu and gpu, default to cpu. If clustering is done or there is a cluster_res_key as input, cpu could fullfill the calculating fast. With clustering information, gpu is strongly recommended for remarkable reduction of time consumption.

In addition, you can set fine_tune_times and n_jobs to get acceleration. fine_tune_times is default to 0, which means that calculating will fine-tune until the results decreasing to only 1. So that you could reduce the number of calculation cycles to speed up the calculating. Data may be trotted out and the results will be basically the same. If the data volume is large, the difference between different cycle numbers may be larger.

Important

If want to run cusingler, NVIDIA GPU and cuda-toolkit 11.4 are both necessary.

Install necessary packages by running commands as below:

pip install cusingler

Annotation with cell labels#

We have uploaded three references, BlueprintEncodeData.Rdata, MouseRNAseqData.Rdata and Mouse_brain_ref.anndata075.h5ad, click our data for more details.

The first two are human and mouse references from R package release, apparently by the file suffix. Click complete information, to get all SingleR references in R package.

The third is a downsampling result of raw big data with the volume of about 19GB.

Transform .rda to .h5ad by rda2h5ad.R :

Rscript rda2h5ad.R --infile MouseRNAseqData.Rdata --outfile MouseRNAseqData.h5ad

[1]:
import stereo as st
import warnings
warnings.filterwarnings('ignore')

test_file = '../data/SS200000135TL_D1.cellbin.gef'
ref_file = '../data/Mouse_brain_ref.anndata075.h5ad'

data = st.io.read_gef(test_file, bin_type='cell_bins')
ref = st.io.read_h5ad(ref_file)
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1090][INFO]: read_gef begin ...
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1334][INFO]: This is GEF file which contains cell bin infomation.
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1335][INFO]: bin_type: cell_bins
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1341][INFO]: Number of cells: 57133
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1344][INFO]: Number of gene: 24670
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1347][INFO]: Resolution: 500
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1350][INFO]: offsetX: 0
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1353][INFO]: offsetY: 0
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1356][INFO]: Average number of genes: 223.460693359375
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1359][INFO]: Maximum number of genes: 1046
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1362][INFO]: Average expression: 399.17034912109375
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1365][INFO]: Maximum expression: 2337
[2024-03-28 13:53:37][Stereo][5186][MainThread][140604473845568][reader][1181][INFO]: the matrix has 57133 cells, and 24670 genes.
version is 0.7.14
version is 0.7.14

Get the expression matrix of reference data:

[2]:
ref
[2]:
AnnData object with n_obs × n_vars = 67419 × 20170
    obs: 'ClusterName'
    var: 'Accession'
    uns: 'sn'

It is obvious that both reference and test data are raw counts. In order to fulfill a better SingleR analysis, we need to perform normalization of log1p and normalize_total beforehand.

[3]:
# preprocessing
ref.tl.log1p()
ref.tl.normalize_total()

data.tl.cal_qc()
data.tl.log1p()
data.tl.normalize_total()
[2024-03-28 13:53:38][Stereo][5186][MainThread][140604473845568][st_pipeline][41][INFO]: start to run log1p...
[2024-03-28 13:53:38][Stereo][5186][MainThread][140604473845568][st_pipeline][44][INFO]: log1p end, consume time 0.3899s.
[2024-03-28 13:53:38][Stereo][5186][MainThread][140604473845568][st_pipeline][41][INFO]: start to run normalize_total...
[2024-03-28 13:53:39][Stereo][5186][MainThread][140604473845568][st_pipeline][44][INFO]: normalize_total end, consume time 0.6591s.
[2024-03-28 13:53:39][Stereo][5186][MainThread][140604473845568][st_pipeline][41][INFO]: start to run cal_qc...
[2024-03-28 13:53:39][Stereo][5186][MainThread][140604473845568][st_pipeline][44][INFO]: cal_qc end, consume time 0.1989s.
[2024-03-28 13:53:39][Stereo][5186][MainThread][140604473845568][st_pipeline][41][INFO]: start to run log1p...
[2024-03-28 13:53:39][Stereo][5186][MainThread][140604473845568][st_pipeline][44][INFO]: log1p end, consume time 0.0686s.
[2024-03-28 13:53:39][Stereo][5186][MainThread][140604473845568][st_pipeline][41][INFO]: start to run normalize_total...
[2024-03-28 13:53:39][Stereo][5186][MainThread][140604473845568][st_pipeline][44][INFO]: normalize_total end, consume time 0.0962s.

Important

If you want to use GPU acceleration, the pandas version must be 1.5.3, otherwise an error will be reported during operation, so please do not modify the pandas version at will.

ref_use_col='ClusterName' is just an example, the actual value is determined based on the reference you use.

[4]:
# gpu
data.tl.single_r(
    ref_exp_data=ref,
    ref_use_col='ClusterName',
    res_key='annotation',
    method='rapids'  #  Specifying the method as rapids means using gpu
)
[2024-03-28 13:53:39][Stereo][5186][MainThread][140604473845568][st_pipeline][77][INFO]: register algorithm single_r to <stereo.core.st_pipeline.StPipeline object at 0x7fe0f47d3940>
INFO:numba.core.transforms:finding looplift candidates
INFO:numba.cuda.cudadrv.driver:init
Same genes in same order, genes size: 19973
train data cost time(s): 5
ref data shape: 67419 x 2213 non-zero number: 101137290
qry data shape: 57133 x 2213 non-zero number: 12482958
free GPU device mem(MB): 23190 total need mem(MB): 1577
Load total ref and qry data into GPU device memory
score() used gpu mem(MB): 2484
score data cost time(s): 258
ref data shape: 67419 x 3335 non-zero number: 101137290
qry data shape: 57133 x 3335 non-zero number: 12482958
free GPU device mem(MB): 23354 total need mem(MB): 1715
Load total ref and qry data into GPU device memory
finetune() used gpu mem(MB): 2458
finetune cost time(s): 6467
[2024-03-28 15:46:21][Stereo][5186][MainThread][140604473845568][single_r][200][INFO]: single-r finished, cost 6761.317797660828 seconds
[4]:
bins group first_labels
31679678781665 31679678781665 TEGLU3 TEGLU10
31696858650867 31696858650867 VLMC2 VLMC2
31546534795506 31546534795506 ACTE1 ACTE1
31804232833272 31804232833272 VLMC2 VLMC2
32779190409382 32779190409382 ACTE1 OPC
... ... ... ...
33105607937026 33105607937026 VLMC2 VLMC2
33208687152235 33208687152235 ACNT1 ACNT1
32993938787532 32993938787532 ACTE1 ABC
32534377286979 32534377286979 VLMC2 ABC
32074815786262 32074815786262 VLMC2 VLMC2

57133 rows × 3 columns

Display the annotating result bases on just cells.

[5]:
data.plt.cluster_scatter(res_key='annotation')
[5]:
../_images/Tutorials_SingleR_17_2.png

Annotation after clutering#

We provide the parameter cluster_res_key that SingleR will run in cluster mode. For example below, it will tag leiden with ClusterName.

[6]:
# embedding and clustering
data.tl.pca(use_highly_genes=False, hvg_res_key='highly_variable_genes', n_pcs=20, res_key='pca', svd_solver='arpack')
data.tl.neighbors(pca_res_key='pca', n_pcs=30, res_key='neighbors', n_jobs=8)

data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')
[2024-03-28 15:46:25][Stereo][5186][MainThread][140604473845568][st_pipeline][41][INFO]: start to run pca...
[2024-03-28 15:46:32][Stereo][5186][MainThread][140604473845568][st_pipeline][44][INFO]: pca end, consume time 7.2157s.
[2024-03-28 15:46:32][Stereo][5186][MainThread][140604473845568][st_pipeline][41][INFO]: start to run neighbors...
[2024-03-28 15:46:49][Stereo][5186][MainThread][140604473845568][st_pipeline][44][INFO]: neighbors end, consume time 16.5563s.
[2024-03-28 15:46:49][Stereo][5186][MainThread][140604473845568][st_pipeline][41][INFO]: start to run leiden...
[2024-03-28 15:47:52][Stereo][5186][MainThread][140604473845568][pipeline_utils][27][INFO]: Can not find raw data, the data which may have been normalized will be used.
[2024-03-28 15:47:53][Stereo][5186][MainThread][140604473845568][st_pipeline][44][INFO]: leiden end, consume time 63.7051s.

Then run with ref_use_col whose value is ClusterName from reference, and leiden result as cluster_res_key.

ref_use_col='ClusterName' is just an example, the actual value is determined based on the reference you use.

[7]:
data.tl.single_r(
    ref_exp_data=ref,
    ref_use_col='ClusterName',
    cluster_res_key='leiden',
    res_key='annotation'
)
[2024-03-28 15:47:59][Stereo][5186][MainThread][140604473845568][single_r][153][INFO]: start single-r with n_jobs=40 fine_tune_times=0
[2024-03-28 15:47:59][Stereo][5186][MainThread][140604473845568][single_r][155][INFO]: start training ref...
[2024-03-28 15:48:25][Stereo][5186][MainThread][140604473845568][single_r][158][INFO]: training ref finished, cost 25.801949739456177 seconds
[2024-03-28 15:48:29][Stereo][5186][MainThread][140604473845568][single_r][184][INFO]: start scoring test_data...
100%|██████████| 34/34 [00:00<00:00, 19917.09it/s]
INFO:numba.core.transforms:finding looplift candidates
[2024-03-28 15:48:32][Stereo][5186][MainThread][140604473845568][single_r][187][INFO]: scoring test_data finished, cost 3.1273744106292725 seconds
[2024-03-28 15:48:32][Stereo][5186][MainThread][140604473845568][single_r][189][INFO]: start fine-tuning...
[2024-03-28 15:48:34][Stereo][5186][MainThread][140604473845568][single_r][295][INFO]: fine-tuning with test_data(shape=(14, 19973))
14it [00:00, 3610.67it/s]
INFO:numba.core.transforms:finding looplift candidates
INFO:numba.core.transforms:finding looplift candidates
[2024-03-28 15:49:20][Stereo][5186][MainThread][140604473845568][single_r][192][INFO]: fine-tuning finished, cost 48.00927138328552 seconds
[2024-03-28 15:49:20][Stereo][5186][MainThread][140604473845568][single_r][200][INFO]: single-r finished, cost 87.7246572971344 seconds
[7]:
bins group first_labels
bins
31679678781665 31679678781665 TEGLU4 TEGLU10
31696858650867 31696858650867 VLMC2 ABC
31546534795506 31546534795506 VLMC2 ABC
31804232833272 31804232833272 TEGLU7 TEGLU10
32779190409382 32779190409382 VLMC2 ABC
... ... ... ...
33105607937026 33105607937026 VLMC2 ABC
33208687152235 33208687152235 TEGLU23 TEGLU10
32993938787532 32993938787532 VLMC2 ABC
32534377286979 32534377286979 VLMC2 ABC
32074815786262 32074815786262 VLMC2 ABC

57133 rows × 3 columns

Display the annotating result bases on clustering.

[8]:
data.plt.cluster_scatter(res_key='annotation')
[8]:
../_images/Tutorials_SingleR_24_2.png

Performance#

Set fine_tune_times=1, which means the result has been fine-tuned only once. This will make the process faster but less accurate.

[9]:
# read data
test_file = '../data/SS200000135TL_D1.tissue.gef'
ref_file = '../data/MouseRNAseqData.h5ad'

test = st.io.read_gef(test_file, bin_size=100)
ref = st.io.read_h5ad(ref_file)

# preprocessing
test.tl.cal_qc()
test.tl.normalize_total(target_sum=1e4)
test.tl.log1p()
[2024-03-28 15:49:21][Stereo][5186][MainThread][140604473845568][reader][1090][INFO]: read_gef begin ...
path:../data/SS200000135TL_D1.tissue.gef bin:100
               generateBinInfo - 19.800000 cpu sec
Warning:your source data file only has gene column,but you specify return gene_names and gene ids which is invalid!so we will return an empty array which has same size with gene_names for the gene_ids
[2024-03-28 15:49:34][Stereo][5186][MainThread][140604473845568][reader][1268][INFO]: the matrix has 9124 cells, and 24302 genes.
[2024-03-28 15:49:34][Stereo][5186][MainThread][140604473845568][reader][1269][INFO]: read_gef end.
[2024-03-28 15:49:34][Stereo][5186][MainThread][140604473845568][st_pipeline][41][INFO]: start to run cal_qc...
[2024-03-28 15:49:34][Stereo][5186][MainThread][140604473845568][st_pipeline][44][INFO]: cal_qc end, consume time 0.3196s.
[2024-03-28 15:49:34][Stereo][5186][MainThread][140604473845568][st_pipeline][41][INFO]: start to run normalize_total...
[2024-03-28 15:49:34][Stereo][5186][MainThread][140604473845568][st_pipeline][44][INFO]: normalize_total end, consume time 0.1880s.
[2024-03-28 15:49:34][Stereo][5186][MainThread][140604473845568][st_pipeline][41][INFO]: start to run log1p...
[2024-03-28 15:49:34][Stereo][5186][MainThread][140604473845568][st_pipeline][44][INFO]: log1p end, consume time 0.0667s.
[10]:
test.tl.single_r(
    ref_exp_data=ref,
    ref_use_col='label.main',
    res_key='annotation',
    fine_tune_times=1,
    n_jobs=40
)
[2024-03-28 15:49:34][Stereo][5186][MainThread][140604473845568][st_pipeline][77][INFO]: register algorithm single_r to <stereo.core.st_pipeline.StPipeline object at 0x7fdfe9714490>
[2024-03-28 15:49:37][Stereo][5186][MainThread][140604473845568][single_r][153][INFO]: start single-r with n_jobs=40 fine_tune_times=1
[2024-03-28 15:49:37][Stereo][5186][MainThread][140604473845568][single_r][155][INFO]: start training ref...
[2024-03-28 15:49:41][Stereo][5186][MainThread][140604473845568][single_r][158][INFO]: training ref finished, cost 3.4769856929779053 seconds
[2024-03-28 15:49:41][Stereo][5186][MainThread][140604473845568][single_r][184][INFO]: start scoring test_data...
100%|██████████| 18/18 [00:00<00:00, 33855.37it/s]
[2024-03-28 15:49:44][Stereo][5186][MainThread][140604473845568][single_r][187][INFO]: scoring test_data finished, cost 2.730195999145508 seconds
[2024-03-28 15:49:44][Stereo][5186][MainThread][140604473845568][single_r][189][INFO]: start fine-tuning...
[2024-03-28 15:49:44][Stereo][5186][MainThread][140604473845568][single_r][295][INFO]: fine-tuning with test_data(shape=(9124, 16433))
0it [00:00, ?it/s]INFO:numba.core.transforms:finding looplift candidates
9124it [06:16, 24.21it/s]
[2024-03-28 15:56:08][Stereo][5186][MainThread][140604473845568][single_r][192][INFO]: fine-tuning finished, cost 384.396365404129 seconds
[2024-03-28 15:56:08][Stereo][5186][MainThread][140604473845568][single_r][200][INFO]: single-r finished, cost 393.7839856147766 seconds
[10]:
bins group first_labels
13743895356600 13743895356600 Astrocytes Epithelial cells
13743895356700 13743895356700 Neurons Oligodendrocytes
13743895356800 13743895356800 Neurons Neurons
13743895356900 13743895356900 Neurons Neurons
13743895357000 13743895357000 Neurons Neurons
... ... ... ...
57552561777200 57552561777200 Astrocytes Astrocytes
57552561777300 57552561777300 Neurons Neurons
57552561777400 57552561777400 Astrocytes Neurons
57552561777700 57552561777700 Astrocytes Neurons
57552561777800 57552561777800 Neurons Neurons

9124 rows × 3 columns

[11]:
data.plt.cluster_scatter(res_key='annotation')
[11]:
../_images/Tutorials_SingleR_29_2.png
[12]:
test.tl.single_r(
    ref_exp_data=ref,
    ref_use_col='label.main',
    res_key='annotation',
    fine_tune_times=10,
    n_jobs=40
)
[2024-03-28 15:56:14][Stereo][5186][MainThread][140604473845568][single_r][153][INFO]: start single-r with n_jobs=40 fine_tune_times=10
[2024-03-28 15:56:14][Stereo][5186][MainThread][140604473845568][single_r][155][INFO]: start training ref...
[2024-03-28 15:56:20][Stereo][5186][MainThread][140604473845568][single_r][158][INFO]: training ref finished, cost 5.6476099491119385 seconds
[2024-03-28 15:56:20][Stereo][5186][MainThread][140604473845568][single_r][184][INFO]: start scoring test_data...
100%|██████████| 18/18 [00:00<00:00, 19463.13it/s]
[2024-03-28 15:56:23][Stereo][5186][MainThread][140604473845568][single_r][187][INFO]: scoring test_data finished, cost 3.302593946456909 seconds
[2024-03-28 15:56:23][Stereo][5186][MainThread][140604473845568][single_r][189][INFO]: start fine-tuning...
[2024-03-28 15:56:23][Stereo][5186][MainThread][140604473845568][single_r][295][INFO]: fine-tuning with test_data(shape=(9124, 16433))
80it [00:00, 579.34it/s]INFO:numba.core.transforms:finding looplift candidates
9124it [08:06, 18.76it/s]
[2024-03-28 16:04:50][Stereo][5186][MainThread][140604473845568][single_r][192][INFO]: fine-tuning finished, cost 506.8143231868744 seconds
[2024-03-28 16:04:50][Stereo][5186][MainThread][140604473845568][single_r][200][INFO]: single-r finished, cost 520.5640707015991 seconds
[12]:
bins group first_labels
13743895356600 13743895356600 Epithelial cells Epithelial cells
13743895356700 13743895356700 Oligodendrocytes Oligodendrocytes
13743895356800 13743895356800 Neurons Neurons
13743895356900 13743895356900 Neurons Neurons
13743895357000 13743895357000 Neurons Neurons
... ... ... ...
57552561777200 57552561777200 Neurons Astrocytes
57552561777300 57552561777300 Neurons Neurons
57552561777400 57552561777400 Neurons Neurons
57552561777700 57552561777700 Neurons Neurons
57552561777800 57552561777800 Neurons Neurons

9124 rows × 3 columns

[13]:
data.plt.cluster_scatter(res_key='annotation')
[13]:
../_images/Tutorials_SingleR_31_2.png
[14]:
test.tl.single_r(
    ref_exp_data=ref,
    ref_use_col='label.main',
    res_key='annotation',
    method='rapids'
)
[2024-03-28 16:05:46][Stereo][5186][MainThread][140604473845568][single_r][200][INFO]: single-r finished, cost 54.125330209732056 seconds
Same genes in same order, genes size: 16433
train data cost time(s): 1
ref data shape: 358 x 3226 non-zero number: 5883014
qry data shape: 9124 x 3226 non-zero number: 25284828
free GPU device mem(MB): 23701 total need mem(MB): 175
Load total ref and qry data into GPU device memory
score() used gpu mem(MB): 571
score data cost time(s): 10
ref data shape: 358 x 7457 non-zero number: 5883014
qry data shape: 9124 x 7457 non-zero number: 25284828
free GPU device mem(MB): 23647 total need mem(MB): 20
Load total ref and qry data into GPU device memory
finetune() used gpu mem(MB): 470
finetune cost time(s): 21
[14]:
bins group first_labels
13743895356600 13743895356600 Epithelial cells Epithelial cells
13743895356700 13743895356700 Oligodendrocytes Oligodendrocytes
13743895356800 13743895356800 Neurons Neurons
13743895356900 13743895356900 Neurons Neurons
13743895357000 13743895357000 Neurons Neurons
... ... ... ...
57552561777200 57552561777200 Neurons Astrocytes
57552561777300 57552561777300 Neurons Neurons
57552561777400 57552561777400 Neurons Neurons
57552561777700 57552561777700 Neurons Neurons
57552561777800 57552561777800 Neurons Neurons

9124 rows × 3 columns

[15]:
data.plt.cluster_scatter(res_key='annotation')
[15]:
../_images/Tutorials_SingleR_33_2.png