Performance#

In this case, we work on clustering on several types of bin size for testing performance.

System requirements#

Hardware

Using Intel Core i5-1135G7 with 32GB memory.

Software

OS: WSL(Linux version 4.4.0-19041-Microsoft)

Python: Python 3.8.13

Stereopy: Stereopy 0.6.0 in conda-forge

Test process#

Download the example data of mouse brain, SS200000135TL_D1.tissue.gef.

[ ]:
import stereo as st
import warnings
warnings.filterwarnings('ignore')

def test_clustering_performance(gef_file, bin_size):
    data = st.io.read_gef(gef_file, bin_size=bin_size)
    data.tl.cal_qc()
    data.tl.raw_checkpoint()
    data.tl.normalize_total(target_sum=1e4)
    data.tl.log1p()
    data.tl.highly_variable_genes(min_mean=0.0125, max_mean=3, min_disp=0.5, res_key='highly_variable_genes', n_top_genes=None)
    data.tl.scale(zero_center=False)
    data.tl.pca(use_highly_genes=True, hvg_res_key='highly_variable_genes', n_pcs=20, res_key='pca', svd_solver='arpack')
    data.tl.neighbors(pca_res_key='pca', n_pcs=30, res_key='neighbors', n_jobs=8)
    data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap', init_pos='spectral')
    data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')
    data.tl.find_marker_genes(cluster_res_key='leiden', method='t_test', use_highly_genes=False, use_raw=True)
    return data

if __name__ == '__main__':
    gef_file_ = './SS200000135TL_D1.tissue.gef'
    bin_size_ = 50 # or 100 or 200
    print(f'work with path: `{gef_file_}`, bin: {bin_size_}')
    _ = test_clustering_performance(gef_file_, bin_size_)

Clustering performance#

Test Clustering Performance with bin50, bin100, bin200 GEF

Bin Size

Cells Num

Genes Num

Percent of CPU

Max RSS

Cost Second (m:ss)

50

35890

20816

124%

10.32gb

3:01.20

100

9111

20816

160%

3.60gb

0:51.45

200

2342

20816

148%

1.85gb

0:22.56

Usually, find_marker_genes is the most time-consuming step during the whole task.

Memory use#

We show the memory using of the clustering process of which bin size is 50.

Note

Without stepping find_marker_genes.

Filename is test_clustering.py (test via the python module memory_profiler).

[ ]:
 8    592.4 MiB    592.4 MiB           1   @memory_profiler.profile(stream=open("/mnt/d/projects/stereopy_dev/demo_data/SS200000135TL_D1/test_stereopy_mem.log", "w+"))

 9                                         def test_clustering_performance(gef_file, bin_size):

10   1162.1 MiB    569.7 MiB           1       data = st.io.read_gef(gef_file, bin_size=bin_size)

11   1162.6 MiB      0.5 MiB           1       data.tl.cal_qc()

12   1216.4 MiB     53.8 MiB           1       data.tl.raw_checkpoint()

13   1243.3 MiB     26.9 MiB           1       data.tl.normalize_total(target_sum=1e4)

14   1270.2 MiB     26.9 MiB           1       data.tl.log1p()

15   1274.0 MiB      3.9 MiB           1       data.tl.highly_variable_genes(min_mean=0.0125, max_mean=3, min_disp=0.5, res_key='highly_variable_genes', n_top_genes=None)

16   1274.1 MiB      0.1 MiB           1       data.tl.scale(zero_center=False)

17   1339.7 MiB     65.6 MiB           1       data.tl.pca(use_highly_genes=True, hvg_res_key='highly_variable_genes', n_pcs=20, res_key='pca', svd_solver='arpack')

18   1487.9 MiB    148.2 MiB           1       data.tl.neighbors(pca_res_key='pca', n_pcs=30, res_key='neighbors', n_jobs=8)

19   1492.4 MiB      4.5 MiB           1       data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap', init_pos='spectral')

20   1518.5 MiB     26.0 MiB           1       data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')

21                                             #data.tl.find_marker_genes(cluster_res_key='leiden', method='t_test', use_highly_genes=False, use_raw=True)

22   1518.5 MiB      0.0 MiB           1       return data