Performance#
In this case, we work on clustering on several types of bin size for testing performance.
System requirements#
Hardware
Using Intel Core i5-1135G7
with 32GB
memory.
Software
OS: WSL(Linux version 4.4.0-19041-Microsoft)
Python: Python 3.8.13
Stereopy: Stereopy 0.6.0 in conda-forge
Test process#
Download the example data of mouse brain, SS200000135TL_D1.tissue.gef
.
[ ]:
import stereo as st
import warnings
warnings.filterwarnings('ignore')
def test_clustering_performance(gef_file, bin_size):
data = st.io.read_gef(gef_file, bin_size=bin_size)
data.tl.cal_qc()
data.tl.raw_checkpoint()
data.tl.normalize_total(target_sum=1e4)
data.tl.log1p()
data.tl.highly_variable_genes(min_mean=0.0125, max_mean=3, min_disp=0.5, res_key='highly_variable_genes', n_top_genes=None)
data.tl.scale(zero_center=False)
data.tl.pca(use_highly_genes=True, hvg_res_key='highly_variable_genes', n_pcs=20, res_key='pca', svd_solver='arpack')
data.tl.neighbors(pca_res_key='pca', n_pcs=30, res_key='neighbors', n_jobs=8)
data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap', init_pos='spectral')
data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')
data.tl.find_marker_genes(cluster_res_key='leiden', method='t_test', use_highly_genes=False, use_raw=True)
return data
if __name__ == '__main__':
gef_file_ = './SS200000135TL_D1.tissue.gef'
bin_size_ = 50 # or 100 or 200
print(f'work with path: `{gef_file_}`, bin: {bin_size_}')
_ = test_clustering_performance(gef_file_, bin_size_)
Clustering performance#
Test Clustering Performance with bin50, bin100, bin200
GEF
Bin Size |
Cells Num |
Genes Num |
Percent of CPU |
Max RSS |
Cost Second (m:ss) |
---|---|---|---|---|---|
50 |
35890 |
20816 |
124% |
10.32gb |
3:01.20 |
100 |
9111 |
20816 |
160% |
3.60gb |
0:51.45 |
200 |
2342 |
20816 |
148% |
1.85gb |
0:22.56 |
Usually, find_marker_genes
is the most time-consuming step during the whole task.
Memory use#
We show the memory using of the clustering process of which bin size is 50.
Note
Without stepping find_marker_genes
.
Filename is test_clustering.py (test via the python module memory_profiler
).
[ ]:
8 592.4 MiB 592.4 MiB 1 @memory_profiler.profile(stream=open("/mnt/d/projects/stereopy_dev/demo_data/SS200000135TL_D1/test_stereopy_mem.log", "w+"))
9 def test_clustering_performance(gef_file, bin_size):
10 1162.1 MiB 569.7 MiB 1 data = st.io.read_gef(gef_file, bin_size=bin_size)
11 1162.6 MiB 0.5 MiB 1 data.tl.cal_qc()
12 1216.4 MiB 53.8 MiB 1 data.tl.raw_checkpoint()
13 1243.3 MiB 26.9 MiB 1 data.tl.normalize_total(target_sum=1e4)
14 1270.2 MiB 26.9 MiB 1 data.tl.log1p()
15 1274.0 MiB 3.9 MiB 1 data.tl.highly_variable_genes(min_mean=0.0125, max_mean=3, min_disp=0.5, res_key='highly_variable_genes', n_top_genes=None)
16 1274.1 MiB 0.1 MiB 1 data.tl.scale(zero_center=False)
17 1339.7 MiB 65.6 MiB 1 data.tl.pca(use_highly_genes=True, hvg_res_key='highly_variable_genes', n_pcs=20, res_key='pca', svd_solver='arpack')
18 1487.9 MiB 148.2 MiB 1 data.tl.neighbors(pca_res_key='pca', n_pcs=30, res_key='neighbors', n_jobs=8)
19 1492.4 MiB 4.5 MiB 1 data.tl.umap(pca_res_key='pca', neighbors_res_key='neighbors', res_key='umap', init_pos='spectral')
20 1518.5 MiB 26.0 MiB 1 data.tl.leiden(neighbors_res_key='neighbors', res_key='leiden')
21 #data.tl.find_marker_genes(cluster_res_key='leiden', method='t_test', use_highly_genes=False, use_raw=True)
22 1518.5 MiB 0.0 MiB 1 return data