Performance#

In this case, we work on clustering on several types of bin size for testing performance.

System requirements#

Hardware

Using Intel Core i5-1135G7 with 32GB memory.

Software

OS: WSL(Linux version 4.4.0-19041-Microsoft)

Python: Python 3.8.13

Stereopy: Stereopy 0.6.0 in conda-forge

Test process#

Download the example data of mouse brain, SS200000135TL_D1.tissue.gef.

[ ]:

import stereo as st
import warnings
warnings.filterwarnings('ignore')

def test_clustering_performance(gef_file, bin_size):
    data = st.io.read_gef(gef_file, bin_size=bin_size)
    data.tl.cal_qc()
    data.tl.raw_checkpoint()
    data.tl.normalize_total(target_sum=1e4)
    data.tl.log1p()
    data.tl.highly_variable_genes(min_mean=0.0125, max_mean=3, min_disp=0.5, res_key='highly_variable_genes', n_top_genes=None)
    data.tl.scale(zero_center=False)
    data.tl.pca(use_highly_genes=True, hvg_res_key='highly_variable_genes', n_pcs=20, res_key='pca', svd_solver='arpack')
    data.tl.neighbors(pca_res_key='pca', n_pcs=30, res_key='neighbors', n_jobs=8)
    data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap', init_pos='spectral')
    data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')
    data.tl.find_marker_genes(cluster_res_key='leiden', method='t_test', use_highly_genes=False, use_raw=True)
    return data

if __name__ == '__main__':
    gef_file_ = './SS200000135TL_D1.tissue.gef'
    bin_size_ = 50 # or 100 or 200
    print(f'work with path: `{gef_file_}`, bin: {bin_size_}')
    _ = test_clustering_performance(gef_file_, bin_size_)

Clustering performance#

Test Clustering Performance with bin50, bin100, bin200 GEF

Bin Size	Cells Num	Genes Num	Percent of CPU	Max RSS	Cost Second (m:ss)
50	35890	20816	124%	10.32gb	3:01.20
100	9111	20816	160%	3.60gb	0:51.45
200	2342	20816	148%	1.85gb	0:22.56

Usually, find_marker_genes is the most time-consuming step during the whole task.

Memory use#

We show the memory using of the clustering process of which bin size is 50.

Note

Without stepping find_marker_genes.

Filename is test_clustering.py (test via the python module memory_profiler).

[ ]:

  592.4 MiB    592.4 MiB           1   @memory_profiler.profile(stream=open("/mnt/d/projects/stereopy_dev/demo_data/SS200000135TL_D1/test_stereopy_mem.log", "w+"))

                                       def test_clustering_performance(gef_file, bin_size):

 1162.1 MiB    569.7 MiB           1       data = st.io.read_gef(gef_file, bin_size=bin_size)

 1162.6 MiB      0.5 MiB           1       data.tl.cal_qc()

 1216.4 MiB     53.8 MiB           1       data.tl.raw_checkpoint()

 1243.3 MiB     26.9 MiB           1       data.tl.normalize_total(target_sum=1e4)

 1270.2 MiB     26.9 MiB           1       data.tl.log1p()

 1274.0 MiB      3.9 MiB           1       data.tl.highly_variable_genes(min_mean=0.0125, max_mean=3, min_disp=0.5, res_key='highly_variable_genes', n_top_genes=None)

 1274.1 MiB      0.1 MiB           1       data.tl.scale(zero_center=False)

 1339.7 MiB     65.6 MiB           1       data.tl.pca(use_highly_genes=True, hvg_res_key='highly_variable_genes', n_pcs=20, res_key='pca', svd_solver='arpack')

 1487.9 MiB    148.2 MiB           1       data.tl.neighbors(pca_res_key='pca', n_pcs=30, res_key='neighbors', n_jobs=8)

 1492.4 MiB      4.5 MiB           1       data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap', init_pos='spectral')

 1518.5 MiB     26.0 MiB           1       data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')

                                           #data.tl.find_marker_genes(cluster_res_key='leiden', method='t_test', use_highly_genes=False, use_raw=True)

 1518.5 MiB      0.0 MiB           1       return data