Pre-processing CellRanger outputs (multiple samples, Human PBMCs)#

This notebook demonstrates the pipeline for processing scRNA-seq data, covering metric computation, normalization, clustering, cell type prediction, and the identification/removal of doublets and low-quality clusters. The final output is pre-processed, annotated scRNA-seq data.

[2]:

import piaso
import cosg
import numpy as np
import pandas as pd
import scanpy as sc
import logging
from sklearn.preprocessing import StandardScaler
import warnings
import os
import anndata as ad

from matplotlib import rcParams
sc.settings.set_figure_params(scanpy=True, dpi=80, dpi_save=300, frameon=True, vector_friendly=False, fontsize=14)
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42

/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

[3]:

data_dir = "/n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023"
save_dir = "/n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2"
prefix = "HumanPBMCs_Multiome_RNA"

Download the data#

Human PBMC snMultiome data (SAN1, SAN2) from: De Rop, F.V., Hulselmans, G., Flerin, C. et al. Systematic benchmarking of single-cell ATAC-sequencing protocols. Nat Biotechnol 42, 916–926 (2024). https://doi.org/10.1038/s41587-023-01881-x

Also available via https://drive.google.com/file/d/1g1V-VTy2qHK7PJIbW-t9hRFzHCqFCe4T/view?usp=drive_link

[4]:

# !/home/vas744/Software/gdrive files download --destination /n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023 1g1V-VTy2qHK7PJIbW-t9hRFzHCqFCe4T

[10]:

!/home/vas744/Software/gdrive files download --destination /n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023/SAN2 1q5mXAE6YWVlzTJLSG0qUgnATitml0gzX

Downloading filtered_feature_bc_matrix.h5
Successfully downloaded filtered_feature_bc_matrix.h5

[9]:

!/home/vas744/Software/gdrive files download --destination /n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023/SAN1 1Bjr81oJWto-06ak14IXB804xcLypPbRj

Downloading filtered_feature_bc_matrix.h5
Successfully downloaded filtered_feature_bc_matrix.h5

Load the data#

[27]:

data_path='/n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023'
samples = [entry.name for entry in os.scandir(data_path) if entry.is_dir()]
print(samples)

['SAN1', 'SAN2']

[28]:

def load_multiple_anndata(
    data_path,
    samples,
    key_added):
    anndata_list = []
    for sample in samples:
        try:
            # Load the data
            file_name = os.listdir(data_path+'/'+sample+'/RNA')[0]
            print(file_name)
            h5_path=data_path+'/'+sample+'/RNA/'+file_name
            if os.path.exists(h5_path):
                adata = sc.read_10x_h5(h5_path)
                print('Loading the h5 file')

            else:
                adata = sc.read_10x_mtx(data_path+'/'+sample, cache=True)
            adata.var_names_make_unique()
            adata.obs_names_make_unique()
            adata.obs[key_added]=sample
            anndata_list.append(adata)
            print('Sample loaded: ', sample)
        except Exception as e:
            print(f"Error loading {data_path}: {e}")

    return anndata_list

[29]:

adata_list=load_multiple_anndata(data_path=data_path, samples=samples, key_added='Sample')

filtered_feature_bc_matrix.h5

/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

Loading the h5 file
Sample loaded:  SAN1
filtered_feature_bc_matrix.h5
Loading the h5 file
Sample loaded:  SAN2

/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

[30]:

adata_list

[30]:

[AnnData object with n_obs × n_vars = 3545 × 36601
     obs: 'Sample'
     var: 'gene_ids', 'feature_types', 'genome', 'interval',
 AnnData object with n_obs × n_vars = 4360 × 36601
     obs: 'Sample'
     var: 'gene_ids', 'feature_types', 'genome', 'interval']

[33]:

adata=ad.concat(adata_list, join='outer',index_unique="-")
adata

[33]:

AnnData object with n_obs × n_vars = 7905 × 36601
    obs: 'Sample'

[34]:

adata.X.data

[34]:

array([ 1.,  1.,  1., ..., 43., 18., 56.],
      shape=(12509350,), dtype=float32)

Create the ‘raw’ counts layer based on the counts data from the adata object.

[35]:

adata.layers['raw'] = adata.X.copy()
adata

[35]:

AnnData object with n_obs × n_vars = 7905 × 36601
    obs: 'Sample'
    layers: 'raw'

[36]:

adata.var

[36]:


MIR1302-2HG
FAM138A
OR4F5
AL627309.1
AL627309.3
...
AC141272.1
AC023491.2
AC007325.1
AC007325.4
AC007325.2

36601 rows × 0 columns

Quality Control#

Filter cells based on gene counts#

Filter out cells with fewer than 200 detected genes.

[37]:

sc.pp.filter_cells(adata, min_genes=200)

Calculate cell quality metrics#

We identify mitochondrial and ribosomal protein genes and compute their proportion in each cell’s total read count. A high proportion of these reads often indicates low-quality cells. Note: In human data, the gene names have uppercase letters (MT-, RPS-, RPL-). For mouse data, these gene names will start with Mt-, Rps-, Rpl-

[38]:

adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

[39]:

adata.var['ribo'] = adata.var_names.str.startswith('RPS','RPL')
sc.pp.calculate_qc_metrics(adata, qc_vars=['ribo'], percent_top=None, log1p=False, inplace=True)

/tmp/ipykernel_2699886/3834852883.py:1: FutureWarning: Allowing a non-bool 'na' in obj.str.startswith is deprecated and will raise in a future version.
  adata.var['ribo'] = adata.var_names.str.startswith('RPS','RPL')

[40]:

piaso.pl.plot_features_violin(adata,
                              ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
                              width_single=2.0)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_24_0.png

Doublet prediction#

Next, we compute the scrublet score to identify and predict potential doublets.

[41]:

experiments=np.unique(adata.obs['Sample'])
adata.obs['scrublet_score']=np.repeat(0,adata.n_obs)
adata.obs['predicted_doublets']=np.repeat(False,adata.n_obs)

[42]:

import scrublet as scr

for experiment in experiments:
    print(experiment)
    adatai=adata[adata.obs['Sample']==experiment]

    scrub = scr.Scrublet(adatai.X.todense(),random_state=10)
    doublet_scores, predicted_doublets = scrub.scrub_doublets()

    adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets

    adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores

SAN1
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.15
Detected doublet rate = 15.9%
Estimated detectable doublet fraction = 77.1%
Overall doublet rate:
        Expected   = 10.0%
        Estimated  = 20.6%
Elapsed time: 3.1 seconds
SAN2

/tmp/ipykernel_2699886/3464195482.py:10: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
/tmp/ipykernel_2699886/3464195482.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
/tmp/ipykernel_2699886/3464195482.py:12: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
/tmp/ipykernel_2699886/3464195482.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
/tmp/ipykernel_2699886/3464195482.py:12: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[0.14470842 0.05152225 0.06938776 ... 0.09004739 0.03024575 0.04338395]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores

Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.14
Detected doublet rate = 17.1%
Estimated detectable doublet fraction = 79.9%
Overall doublet rate:
        Expected   = 10.0%
        Estimated  = 21.5%
Elapsed time: 5.5 seconds

/tmp/ipykernel_2699886/3464195482.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets

[43]:

piaso.pl.plot_features_violin(adata,
                              ['scrublet_score'],
                              groupby='Sample',
                              width_single=2)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_28_0.png

log1p Normalization#

[44]:

%%time
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

adata.layers['log1p']=adata.X.copy()

CPU times: user 198 ms, sys: 26.5 ms, total: 225 ms
Wall time: 124 ms

INFOG Normalization#

We use PIASO’s INFOG to normalize the data and identify a highly variable set of genes.

[45]:

%%time
piaso.tl.infog(adata,
               copy=False,
               inplace=False,
               n_top_genes=3000,
               key_added='infog',
               key_added_highly_variable_gene='highly_variable',
               layer='raw')

The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 590 ms, sys: 199 ms, total: 789 ms
Wall time: 792 ms

Dimensionality reduction#

[46]:

%%time
piaso.tl.runSVD(adata,
                use_highly_variable=True,
                n_components=30,
                random_state=10,
                key_added='X_svd',
                layer='infog')

CPU times: user 4.07 s, sys: 6.2 ms, total: 4.07 s
Wall time: 493 ms

[47]:

%%time
sc.pp.neighbors(adata,
                use_rep='X_svd',
                n_neighbors=15,
                random_state=10,
                knn=True,
                method="umap")

sc.tl.umap(adata)

CPU times: user 27.5 s, sys: 92.7 ms, total: 27.6 s
Wall time: 24.2 s

[48]:

sc.pl.umap(adata,
           color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
           cmap='Spectral_r',
           palette=piaso.pl.color.d_color1,
           ncols=2,
           size=10,
           frameon=False)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_36_0.png

Save the data#

[13]:

save_dir+'/'+prefix+'_raw_QC.h5ad'

[13]:

'/n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_raw_QC.h5ad'

[50]:

adata.write(save_dir+'/'+prefix+'_raw_QC.h5ad')

[14]:

# adata = sc.read(save_dir+'/'+prefix+'_raw_QC.h5ad')

Filtering cells#

Outlier cells of these thresholds are likely low quality cells, so we filter them out before further downstream analysis.

[15]:

adata=adata[adata.obs['n_genes_by_counts']>500].copy()

[16]:

adata=adata[adata.obs['n_genes_by_counts']<10000].copy()

Remove doublets based on a scrublet score threshold.

[17]:

adata = adata[adata.obs['scrublet_score']<0.3].copy()
adata

[17]:

AnnData object with n_obs × n_vars = 7218 × 36601
    obs: 'Sample', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'scrublet_score', 'predicted_doublets'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'ribo', 'infog_var', 'highly_variable'
    uns: 'Sample_colors', 'log1p', 'neighbors', 'umap'
    obsm: 'X_svd', 'X_umap'
    layers: 'infog', 'log1p', 'raw'
    obsp: 'connectivities', 'distances'

[18]:

sc.pl.umap(adata,
           color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
           cmap='Spectral_r',
           palette=piaso.pl.color.d_color1,
           ncols=2,
           size=10,
           frameon=False)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_46_0.png

[19]:

piaso.pl.plot_embeddings_split(adata,
                               color='n_genes_by_counts',
                               cmap=piaso.pl.color.c_color1,
                               palette=piaso.pl.color.d_color1,
                               layer=None,
                               splitby='Sample',
                               size=25,
                               frameon=False,)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_47_0.png

Annotation#

We used the precomputed Human PBMC marker gene sets from the AllenHumanImmuneHealthAtlas_L2 study in PIASOmarkerDB. The original dataset comes from https://www.nature.com/articles/s41586-025-09686-5.

[20]:

marker_gene_df = pd.read_csv("/n/scratch/users/v/vas744/Data/Public/10x_Human_PBMC/PIASOmarkerDB_AllenHumanImmuneHealthAtlas_L2_251219.csv")
marker_gene_df

[20]:

	Cell_Type	Gene	Specificity_Score	Study_Publication	Species	Tissue	Condition
0	Platelet	GP9	16.455748	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
1	Erythrocyte	ALAS2	16.422475	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
2	Platelet	CMTM5	16.400534	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
3	Platelet	TMEM40	16.373334	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
4	Erythrocyte	AHSP	16.338337	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
...	...	...	...	...	...	...	...
1445	Memory CD4 T cell	FXYD7	1.783520	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
1446	Memory CD4 T cell	CRIP1	1.779265	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
1447	Memory CD4 T cell	CCDC167	1.775660	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
1448	Memory CD4 T cell	CD6	1.775110	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal
1449	Memory CD4 T cell	TMEM238	1.769377	AllenHumanImmuneHealthAtlas_L2	Human	blood	normal

1450 rows × 7 columns

Preprocessing cell type-specific marker gene sets#

PIASO’s predictCellTypeByMarker expects the marker_gene_list in DataFrame or dict format. So here, we will create a dictionary with each cell type and their respective marker gene lists.

[21]:

marker_gene_dict = {}
cell_types = np.unique(marker_gene_df['Cell_Type'])
for cell_type in cell_types:
    marker_gene_dict[cell_type] = list(marker_gene_df[marker_gene_df['Cell_Type']==cell_type]['Gene'])

Reordering cell type categories

1. INFOG Normalization#

[22]:

%%time
piaso.tl.infog(adata,
               copy=False,
               inplace=False,
               n_top_genes=3000,
               key_added='infog',
               key_added_highly_variable_gene='highly_variable',
               layer='raw')

The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 747 ms, sys: 28.8 s, total: 29.5 s
Wall time: 29.6 s

Dimensionality reduction#

[23]:

%%time
piaso.tl.runSVD(adata,
                use_highly_variable=True,
                n_components=30,
                random_state=10,
                key_added='X_svd',
                layer='infog')

CPU times: user 5.36 s, sys: 134 ms, total: 5.5 s
Wall time: 894 ms

[24]:

%%time
sc.pp.neighbors(adata,
                use_rep='X_svd',
                n_neighbors=15,
                random_state=10,
                knn=True,
                method="umap")

sc.tl.umap(adata)

CPU times: user 39.3 s, sys: 308 ms, total: 39.6 s
Wall time: 36.3 s

Predict cell types based on marker gene sets#

Here, we use the use_score = False to make the predictions based on p-values, but you can also use the gene set scores with use_score = True.

[25]:

%%time
piaso.tl.predictCellTypeByMarker(adata,
                                 marker_gene_set=marker_gene_dict,
                                 score_method='piaso',
                                 use_rep='X_svd',
                                 use_score=False,
                                 smooth_prediction=True,
                                 inplace=True)

Calculating gene set scores using piaso method...

Scoring gene sets: 100%|██████████| 29/29 [00:16<00:00,  1.75set/s]

Predicting cell types based on marker gene p-values...
Smoothing cell type predictions...
Smoothing cell type predictions from 'CellTypes_predicted_raw' using 7-nearest neighbors
Smoothed predictions stored in adata.obs['CellTypes_predicted_smoothed']
Confidence scores stored in adata.obs['CellTypes_predicted_smoothed_confidence']
Modified 1632 cell labels (22.61% of total)
Cell type prediction completed. Results saved to:
  - adata.obs['CellTypes_predicted']: predicted cell types
  - adata.obsm['CellTypes_predicted_score']: full score matrix
  - adata.obsm['CellTypes_predicted_nlog10pvals']: full -log10(p-value) matrix
  - adata.obs['CellTypes_predicted_nlog10pvals']: maximum -log10(p-values)
  - adata.obs['CellTypes_predicted_raw']: original unsmoothed predictions
  - adata.obs['CellTypes_predicted_confidence_smoothed']: smoothing confidence scores
CPU times: user 6.85 s, sys: 146 ms, total: 6.99 s
Wall time: 24.6 s

[26]:

category_dict = {"Human_PBMC": ['ASDC', 'CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
                                'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell',
                                'Erythrocyte', 'ILC', 'Intermediate monocyte', 'MAIT', 'Memory B cell',
                                'Memory CD4 T cell', 'Memory CD8 T cell', 'Naive B cell',
                                'Naive CD4 T cell', 'Naive CD8 T cell', 'Plasma cell', 'Platelet',
                                'Progenitor cell','Proliferating NK cell', 'Proliferating T cell',
                                'Transitional B cell', 'Treg', 'cDC1','cDC2', 'gdT','pDC']}

[27]:

cell_types

[27]:

array(['ASDC', 'CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
       'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell',
       'Erythrocyte', 'ILC', 'Intermediate monocyte', 'MAIT',
       'Memory B cell', 'Memory CD4 T cell', 'Memory CD8 T cell',
       'Naive B cell', 'Naive CD4 T cell', 'Naive CD8 T cell',
       'Plasma cell', 'Platelet', 'Progenitor cell',
       'Proliferating NK cell', 'Proliferating T cell',
       'Transitional B cell', 'Treg', 'cDC1', 'cDC2', 'gdT', 'pDC'],
      dtype=object)

[28]:

np.unique(adata.obs['CellTypes_predicted'])

[28]:

array(['CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
       'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell', 'ILC',
       'Intermediate monocyte', 'MAIT', 'Memory B cell',
       'Memory CD4 T cell', 'Memory CD8 T cell', 'Naive B cell',
       'Naive CD4 T cell', 'Naive CD8 T cell', 'Plasma cell', 'Platelet',
       'Progenitor cell', 'Proliferating NK cell', 'Transitional B cell',
       'Treg', 'cDC1', 'cDC2', 'gdT', 'pDC'], dtype=object)

[31]:

adata.obs['CellTypes_predicted'] = adata.obs['CellTypes_predicted'].astype('category')

[32]:

adata.obs['CellTypes_predicted'] = adata.obs['CellTypes_predicted'].cat.add_categories(
    [cat for cat in category_dict["Human_PBMC"]
     if cat not in adata.obs['CellTypes_predicted'].cat.categories]
)

adata.obs['CellTypes_predicted'] = adata.obs['CellTypes_predicted'].cat.reorder_categories(
    category_dict["Human_PBMC"]
)
adata

[32]:

AnnData object with n_obs × n_vars = 7218 × 36601
    obs: 'Sample', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'scrublet_score', 'predicted_doublets', 'CellTypes_predicted', 'CellTypes_predicted_nlog10pvals', 'CellTypes_predicted_raw', 'CellTypes_predicted_smoothed', 'CellTypes_predicted_smoothed_confidence', 'CellTypes_predicted_confidence_smoothed'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'ribo', 'infog_var', 'highly_variable'
    uns: 'Sample_colors', 'log1p', 'neighbors', 'umap'
    obsm: 'X_svd', 'X_umap', 'CellTypes_predicted_score', 'CellTypes_predicted_nlog10pvals'
    layers: 'infog', 'log1p', 'raw'
    obsp: 'connectivities', 'distances'

[33]:

sc.pl.umap(adata,
           color=['CellTypes_predicted'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=8,
           legend_fontoutline=1,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_67_0.png

Clustering#

[34]:

%%time
sc.tl.leiden(adata,resolution=2.5,key_added='Leiden')

<timed eval>:1: FutureWarning: In the future, the default backend for leiden will be igraph instead of leidenalg.

 To achieve the future defaults please pass: flavor="igraph" and n_iterations=2.  directed must also be False to work with igraph's implementation.

CPU times: user 4.01 s, sys: 16.3 ms, total: 4.02 s
Wall time: 4.05 s

[35]:

sc.pl.umap(adata,
           color=['CellTypes_predicted','Leiden'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=12,
           legend_fontoutline=2,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_70_0.png

Identify marker genes with COSG#

[36]:

%%time
n_gene=30
cosg.cosg(adata,
          key_added='cosg',
          use_raw=False,
          layer='infog',
          mu=100,
          expressed_pct=0.1,
          remove_lowly_expressed=True,
          n_genes_user=n_gene,
          groupby='Leiden')

CPU times: user 915 ms, sys: 540 ms, total: 1.46 s
Wall time: 1.47 s

[37]:

sc.tl.dendrogram(adata,groupby='Leiden',use_rep='X_svd')
df_tmp=pd.DataFrame(adata.uns['cosg']['names'][:3,]).T
df_tmp=df_tmp.reindex(adata.uns['dendrogram_'+'Leiden']['categories_ordered'])
marker_genes_list={idx: list(row.values) for idx, row in df_tmp.iterrows()}
marker_genes_list = {k: v for k, v in marker_genes_list.items() if not any(isinstance(x, float) for x in v)}

sc.pl.dotplot(adata,
              marker_genes_list,
              groupby='Leiden',
              layer='infog',
              dendrogram=True,
              swap_axes=False,
              standard_scale='var',
              cmap='Spectral_r')

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_73_0.png

[38]:

marker_gene=pd.DataFrame(adata.uns['cosg']['names'])

Removal of low-quality cells#

Here, we identify and remove some low quality clusters based on cell quality metrics and marker gene sets.

[39]:

piaso.pl.plot_features_violin(adata,
                              ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
                              groupby='Leiden')

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_76_0.png

Some of the cells seems to have higher percentage of mitochondrial and ribosomal gene expression, indicating that they are low quality.

[40]:

adata = adata[adata.obs['pct_counts_mt']<20].copy()

[41]:

adata = adata[adata.obs['pct_counts_ribo']<5].copy()

[42]:

sc.pl.umap(adata,
           color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
           cmap='Spectral_r',
           palette=piaso.pl.color.d_color1,
           ncols=2,
           size=10,
           frameon=False)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_80_0.png

[61]:

sc.pl.umap(adata,
           color=['Leiden'],
           groups=['29'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=12,
           legend_fontoutline=2,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_81_0.png

Checking potential low quality clusters#

[65]:

cluster_check = '20'
marker_gene[cluster_check].values

[65]:

array(['RPL19', 'RPL32', 'RPL13A', 'RPL28', 'RPS25', 'RPS8', 'RPLP2',
       'RPL14', 'RPS12', 'RPS27', 'MKI67', 'RPL22', 'RPL23A', 'RPS15A',
       'RPL35', 'RPL13', 'RPS18', 'RPL3', 'RPL17', 'RPS3', 'EEF1A1',
       'RPS20', 'RPL12', 'RPS3A', 'RPL18A', 'RPLP0', 'RPS23', 'RPL39',
       'RPS6', 'RPL31'], dtype=object)

[63]:

sc.pl.umap(adata,
           color=['IGHA1', 'SDC1', 'IGLV2-11', 'JCHAIN', 'IGHG4', 'TNFRSF17',
       'IGLV2-14', 'GLDC', 'IGHV3-15', 'TXNDC5', 'IGHG2', 'MIXL1', 'MZB1'],
           cmap=piaso.pl.color.c_color1,
           palette=piaso.pl.color.d_color1,
           ncols=3,
           size=10,
           frameon=False)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_84_0.png

[98]:

sc.pl.dotplot(adata,
              marker_gene[cluster_check].values[:30],
              groupby='Leiden',
              dendrogram=False,
              swap_axes=True,
              standard_scale='var',
              cmap='Spectral_r')

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_85_0.png

[64]:

sc.pl.umap(adata,
           color=['Leiden'],
           groups=['17','20','27','28'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=12,
           legend_fontoutline=2,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_86_0.png

[45]:

piaso.pl.plot_embeddings_split(adata,
                               color='CellTypes_predicted',
                               layer=None,
                               splitby='Sample',
                               size=25,
                               frameon=False,)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_87_0.png

[66]:

adata=adata[~adata.obs['Leiden'].isin(['17','20','27','28'])].copy()
adata

[66]:

AnnData object with n_obs × n_vars = 6591 × 36601
    obs: 'Sample', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'scrublet_score', 'predicted_doublets', 'CellTypes_predicted', 'CellTypes_predicted_nlog10pvals', 'CellTypes_predicted_raw', 'CellTypes_predicted_smoothed', 'CellTypes_predicted_smoothed_confidence', 'CellTypes_predicted_confidence_smoothed', 'Leiden'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'ribo', 'infog_var', 'highly_variable'
    uns: 'Sample_colors', 'log1p', 'neighbors', 'umap', 'CellTypes_predicted_colors', 'Leiden', 'Leiden_colors', 'cosg', 'dendrogram_Leiden'
    obsm: 'X_svd', 'X_umap', 'CellTypes_predicted_score', 'CellTypes_predicted_nlog10pvals'
    layers: 'infog', 'log1p', 'raw'
    obsp: 'connectivities', 'distances'

2. INFOG Normalization#

It is important to re-run INFOG normalization and dimensionality reduction after filtering out cells, to accurately capture the current cell population.

[67]:

%%time
piaso.tl.infog(adata,
               copy=False,
               inplace=False,
               n_top_genes=3000,
               key_added='infog',
               key_added_highly_variable_gene='highly_variable',
               layer='raw')

The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 685 ms, sys: 4.9 s, total: 5.59 s
Wall time: 5.61 s

Dimensionality reduction#

[68]:

%%time
piaso.tl.runSVD(adata,
                use_highly_variable=True,
                n_components=30,
                random_state=10,
                key_added='X_svd',
                layer='infog')

CPU times: user 4.72 s, sys: 0 ns, total: 4.72 s
Wall time: 656 ms

[69]:

%%time
sc.pp.neighbors(adata,
                use_rep='X_svd',
                n_neighbors=15,
                random_state=10,
                knn=True,
                method="umap")

sc.tl.umap(adata)

CPU times: user 25.4 s, sys: 11.8 ms, total: 25.4 s
Wall time: 21 s

Predict cell types based on marker gene sets#

[70]:

%%time
piaso.tl.predictCellTypeByMarker(adata,
                                 marker_gene_set=marker_gene_dict,
                                 score_method='piaso',
                                 use_rep='X_svd',
                                 use_score=False,
                                 key_added = 'CellTypes_predicted',
                                 smooth_prediction=True,
                                 inplace=True)

Calculating gene set scores using piaso method...

Scoring gene sets: 100%|██████████| 29/29 [00:17<00:00,  1.69set/s]

Predicting cell types based on marker gene p-values...
Smoothing cell type predictions...
Smoothing cell type predictions from 'CellTypes_predicted_raw' using 7-nearest neighbors
Smoothed predictions stored in adata.obs['CellTypes_predicted_smoothed']
Confidence scores stored in adata.obs['CellTypes_predicted_smoothed_confidence']
Modified 1402 cell labels (21.27% of total)
Cell type prediction completed. Results saved to:
  - adata.obs['CellTypes_predicted']: predicted cell types
  - adata.obsm['CellTypes_predicted_score']: full score matrix
  - adata.obsm['CellTypes_predicted_nlog10pvals']: full -log10(p-value) matrix
  - adata.obs['CellTypes_predicted_nlog10pvals']: maximum -log10(p-values)
  - adata.obs['CellTypes_predicted_raw']: original unsmoothed predictions
  - adata.obs['CellTypes_predicted_confidence_smoothed']: smoothing confidence scores
CPU times: user 10.3 s, sys: 103 ms, total: 10.4 s
Wall time: 24.6 s

[71]:

sc.pl.umap(adata,
           color=['CellTypes_predicted'],
           groups=['CD8aa'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=8,
           legend_fontoutline=1,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_96_0.png

[72]:

sc.pl.umap(adata,
           color=['CellTypes_predicted'],
           # groups=['CD8aa'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=8,
           legend_fontoutline=1,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_97_0.png

Identify marker genes with COSG#

[73]:

%%time
n_gene=30
cosg.cosg(adata,
          key_added='cosg',
          use_raw=False,
          layer='infog',
          mu=100,
          expressed_pct=0.1,
          remove_lowly_expressed=True,
          n_genes_user=n_gene,
          groupby='CellTypes_predicted')

CPU times: user 723 ms, sys: 98.9 ms, total: 822 ms
Wall time: 825 ms

[74]:

marker_gene=pd.DataFrame(adata.uns['cosg']['names'])
marker_gene

[74]:

	ASDC	CD8aa	CD14 monocyte	CD16 monocyte	CD56bright NK cell	CD56dim NK cell	DN T cell	Effector B cell	Erythrocyte	ILC	...	Naive CD4 T cell	Naive CD8 T cell	Plasma cell	Proliferating NK cell	Transitional B cell	Treg	cDC1	cDC2	gdT	pDC
0	IGLV5-52	AC005944.1	DYSF	AC104809.2	XCL1	BNC2	AC004490.1	SOX5	LINC01241	LHX6	...	ANKRD55	AL590006.1	IGHA1	AL096677.2	RAPGEF5	FOXP3	PLS3-AS1	FCER1A	TRDC	LINC01478
1	AC012313.2	AL031590.1	CRISPLD2	CDKN1C	SPTSSB	MYOM2	AP001360.2	PTCHD1-AS	AL139023.1	AL353689.3	...	EDAR	NRCAM	IGLV7-43	IGKV1-9	TTC28	LINC02694	STRA6	CLEC10A	TRGC1	AC097375.1
2	EPB41L4B	TNIP3	VCAN	GPR20	CABLES1	LINGO2	AL162427.1	FCRL5	AL021937.4	AC005520.3	...	FHIT	NELL2	JCHAIN	IGHV3-66	MYBL2	HACD1	AC090907.1	ENHO	TRGV9	CUX2
3	AC145207.1	ZSCAN26	PADI4	LYPD2	TNFRSF11A	SPON2	PGAM4	CLECL1	AC105213.1	AL109955.2	...	PLCL1	LINC02446	TNFRSF17	AC027279.2	GPM6A	AC013652.1	DCLK3	CD1E	KLRC1	LRRC26
4	SLC24A5	PDGFB	NRG1	PPP1R17	PPP1R9A	MIR181A2HG	ALDH3B2	AL592429.2	CCL26	AL662844.3	...	KANK1	GLIS3	IGHV3-15	DUSP13	AL031595.3	FANK1	AC108169.1	CD1C	CCL5	AC023590.1
5	LINC00664	AL121839.2	FCAR	AC020651.2	ZMAT4	GRIK4	AC007029.1	GPM6A	AC074011.1	AC007370.1	...	TSHZ2	CA6	TXNDC5	TCF23	HS3ST1	RTKN2	ANKRD65	ST18	A2M	EPHB1
6	MOBP	AMIGO1	LUCAT1	CROCC2	SERPINE1	SORBS2	AC010255.1	TNFRSF13B	AC026765.3	AC007998.4	...	MAL	CD8B	IGKV3-15	TRHR	P2RX5	DUSP4	AL118508.1	NCKAP5	PZP	LILRA4
7	AL138999.2	ZSCAN2	AL163541.1	VMO1	KLRC2	FGFBP2	LINC01865	NIPSNAP3B	C12orf56	LRRC66	...	LEF1	FCGBP	MIXL1	FAM215A	AC060234.3	CTLA4	AC026391.1	PKIB	GNLY	PTPRS
8	ZSCAN23	AL359706.1	CXCL8	CKB	AC002429.2	CD160	AL121757.2	MS4A1	SPATA41	AC084082.1	...	ITGA6	LRRN3	IGLC2	AL133297.1	JUP	STAM	PRSS48	MRC1	IL12RB2	PTCRA
9	AL445123.2	STYK1	MCEMP1	LINC02345	RAMP1	LINC01505	TEK	ZBTB20-AS5	CMTM2	PCAT6	...	MLLT3	PDE3B	IGHV3-30	LINC01005	BCL7A	IL2RA	ALOX15	LRP1B	S100B	LAMP5
10	AC002347.1	ZNF681	LIN7A	LINC02085	IGFBP4	RHOBTB3	AC116407.1	EML6	AP001005.3	LINC02815	...	FAM153A	THEMIS	MZB1	AC006960.3	EFCAB11	CDHR3	CYP24A1	CLIC2	LINC00612	KCNA5
11	STRA8	RNFT2	CSF3R	HES4	STYK1	SH2D1B	AC020907.1	ANKRD33B	AC006116.8	NOL4L-DT	...	LINC01891	APBA2	SDC1	AC022517.1	CORO2B	ZC2HC1A	ENPP1	IL1R2	KLRG1	CLEC4C
12	AP001094.2	METTL18	NLRP12	FMNL2	NCAM1	RNF165	AC027130.1	IL7	FXYD6	AC116158.2	...	PRKCA	SNTG2	GPRC5D	PKMYT1	AL365272.1	AC011997.1	DNAAF3	HTR7	TGFBR3	AL513493.1
13	AC112512.1	TMEM116	WLS	CEACAM3	IL12RB2	LINC00298	AC090125.1	OSBPL10	TNS1	TBC1D8-AS1	...	AK5	CACHD1	GLDC	IGLC1	PAOX	IKZF2	CFB	DTNA	DUSP2	SCT
14	NPAS1	BCAT2	PRICKLE1	SMIM25	KLRC3	PRF1	AC018437.3	CD82	PPFIA3	OTX2-AS1	...	MAN1C1	LEF1	IGHV3-33	TERT	TCL1A	MCF2L2	ARHGEF26-AS1	FLT3	CYB561	FAM160A1-DT
15	GATA4	VIPR1-AS1	SPOCK1	NEURL1	ROR1	KLRF1	GFRA3	AL355076.2	AC005332.3	NGB	...	FAM153B	PSMA1	PLAAT2	PAGE2B	AL139020.1	CCDC141	AC107918.5	ADGRG6	SH3BGRL3	LINC01724
16	C20orf144	ZNF268	GLT1D1	CLEC4F	EFHC2	IGFBP7	LINC01339	ANGPTL1	AL589935.1	CCDC40	...	SLC22A23	SH3RF3	IGHV3-23	IGLV2-11	MCF2L	TTN	LINC01845	SH3BP4	AC010175.1	AC018643.1
17	EXOC3L2	LYSMD4	CYP1B1	PAPSS2	XCL2	NMUR1	AC009435.1	FCRL2	AC008649.2	AL161757.4	...	TRABD2A	LEF1-AS1	IGKC	CDC45	TCL6	F5	AL122008.4	CST3	A2M-AS1	LINC01226
18	VWC2	KLRC4	ACSL1	TCF7L2	CCNJL	SLCO4C1	AC015631.1	RALGPS2	C17orf64	AF233439.1	...	IL6ST	AC002460.2	DERL3	IGHV5-51	IGHD	TOX	ADORA3	SLAMF8	KLRD1	AC011893.1
19	CLEC18A	TNFRSF9	VCAN-AS1	C1QA	B4GALT6	LDB2	CA10	FCRLA	AC011092.3	MIA	...	EPHX2	EPHA1-AS1	ITGA8	AC100827.4	CR2	IATPR	CTSV	DANT2	RAB9A	AR
20	LINC02033	STK36	LINC00937	KNDC1	TNFRSF18	PRSS23	AL139042.1	GEN1	AC069549.1	STXBP5L	...	AC139720.1	OXNAD1	ZNF215	IGHGP	PPAN	DCP1B	CERS6-AS1	CYP2S1	NKG7	PHEX
21	AC025263.1	LINC01118	PROK2	ICAM4	GNLY	AKR1C3	AC092844.1	ACP5	LRRC42	HCRTR1	...	CAMK4	CCR7	AC136431.1	AC012676.3	CD79B	AC093865.1	CLEC9A	NEGR1	C1orf21	COL26A1
22	AC091180.2	C21orf62-AS1	TREM1	SFTPD	KLRC1	PDGFD	IMPG1	CD79B	SLC29A1	AC096570.1	...	CCR7	SCML1	IGF1	ANLN	DPF3	CYTOR	AC108449.2	HLA-DRB1	GZMH	LINC00865
23	NRXN2	KLHL25	EREG	LST1	ATP8B4	CHST2	C18orf63	CD79A	TPRN	SRR	...	IGF1R	CD8A	APOBEC3B	AC003043.1	KMO	DUSP16	BBOX1	LHFPL6	GZMA	NRP1
24	SIGLEC15	SLC35C1	AC020916.1	FCGR3A	NCALD	IL2RB	AURKC	EBF1	ZBED5-AS1	AC091057.6	...	NDFIP1	PRKAR1B	IGLV2-14	AL359091.5	IGHM	PHF19	AC093158.1	ITGB5	GZMM	CYYR1
25	AL138694.1	DHRS13	PLAUR	PDE8B	HOPX	NCR1	GPR176	DENND5B	LINC01010	AL359504.2	...	AL109930.1	MAML2	AC009570.2	MELK	CD22	NIBAN1	AC092661.1	HLA-DPB1	IL32	ARPP21
26	AC008764.6	MED12	CD36	IFITM3	CD7	GFOD1	AC087564.1	TLR9	BRI3BP	HTR2C	...	KLHL29	LRRC7	DCC	GINS2	PCDH9	COX10-AS1	LDLRAD2	FBLN2	GZMK	FAM160A1
27	AL031709.1	TRPC1	TNFAIP6	FAM20A	IL2RB	CLIC3	PRSS51	ZDHHC21	PDXP	AL035658.1	...	LINC01550	NR3C2	BMP6	APOBEC3B-AS1	CEP89	LINC00426	IAPP	HLA-DPA1	LYAR	AC104819.3
28	PCBP2-OT1	TULP2	CYP1B1-AS1	MS4A7	NCR1	TTC38	LINC00970	TLR10	ZNF653	MRVI1	...	CHRM3-AS2	SERINC5	NT5DC2	SPC25	ZNF318	LDLRAD4	AC133961.1	ZNF366	HCST	SRPX
29	CPT1C	AC137810.1	RBM47	SIGLEC10	CTSW	ADGRG1	LINC01090	AC090541.1	LRRC37A	AC036176.1	...	TCF7	ABLIM1	ITM2C	AC015878.1	FCRLA	CASK	CADM1	EMP1	TRGC2	LINC00996

30 rows × 26 columns

We can use a dendrogram dot plot to visualize the expression of the top three marker genes of each predicted cell type.

[75]:

sc.tl.dendrogram(adata,groupby='CellTypes_predicted',use_rep='X_svd')
df_tmp=pd.DataFrame(adata.uns['cosg']['names'][:3,]).T
df_tmp=df_tmp.reindex(adata.uns['dendrogram_'+'CellTypes_predicted']['categories_ordered'])
marker_genes_list={idx: list(row.values) for idx, row in df_tmp.iterrows()}
marker_genes_list = {k: v for k, v in marker_genes_list.items() if not any(isinstance(x, float) for x in v)}

sc.pl.dotplot(adata,
              marker_genes_list,
              groupby='CellTypes_predicted',
              layer='infog',
              dendrogram=True,
              swap_axes=False,
              standard_scale='var',
              cmap='Spectral_r')

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_102_0.png

Marker genes for CD8aa monocyte#

[78]:

cluster_check = 'CD8aa'
marker_gene[cluster_check].values

[78]:

array(['AC005944.1', 'AL031590.1', 'TNIP3', 'ZSCAN26', 'PDGFB',
       'AL121839.2', 'AMIGO1', 'ZSCAN2', 'AL359706.1', 'STYK1', 'ZNF681',
       'RNFT2', 'METTL18', 'TMEM116', 'BCAT2', 'VIPR1-AS1', 'ZNF268',
       'LYSMD4', 'KLRC4', 'TNFRSF9', 'STK36', 'LINC01118', 'C21orf62-AS1',
       'KLHL25', 'SLC35C1', 'DHRS13', 'MED12', 'TRPC1', 'TULP2',
       'AC137810.1'], dtype=object)

[79]:

sc.pl.umap(adata,
           color=['RNF26', 'BCKDK', 'STYK1', 'ZNF268', 'RNFT2', 'BICRA-AS1',
       'ZNF510', 'AMIGO1', 'MIER3', 'AC013437.1', 'SLC24A1', 'LAGE3'],
           cmap=piaso.pl.color.c_color1,
           palette=piaso.pl.color.d_color1,
           ncols=3,
           size=10,
           frameon=False)

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_105_0.png

[80]:

sc.pl.dotplot(adata,
              marker_gene[cluster_check].values[:30],
              groupby='CellTypes_predicted',
              dendrogram=False,
              swap_axes=True,
              standard_scale='var',
              cmap='Spectral_r')

../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_106_0.png

Save the data#

[138]:

np.unique(adata.obs['CellTypes_predicted'])

[138]:

array(['ASDC', 'CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
       'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell',
       'Erythrocyte', 'ILC', 'Intermediate monocyte', 'MAIT',
       'Memory B cell', 'Memory CD4 T cell', 'Memory CD8 T cell',
       'Naive B cell', 'Naive CD4 T cell', 'Naive CD8 T cell',
       'Plasma cell', 'Proliferating NK cell', 'Transitional B cell',
       'Treg', 'cDC1', 'cDC2', 'gdT', 'pDC'], dtype=object)

[76]:

adata.write(save_dir + '/' + prefix + '_QC.h5ad')

[5]:

adata = sc.read(save_dir + '/' + prefix + '_QC.h5ad')

[77]:

marker_gene.to_csv(save_dir + '/' + prefix + '_CellType_markerGenes.csv')

[81]:

!/home/vas744/Software/gdrive files upload --parent 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0 /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_QC.h5ad

Uploading /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_QC.h5ad
File successfully uploaded
Id: 1bpWCqhadqw29Av9RsqAJqWW7kmEpQOHQ
Name: HumanPBMCs_Multiome_RNA_QC.h5ad
Mime: application/octet-stream
Size: 318.5 MB
Created: 2026-01-20 14:02:15
Modified: 2026-01-20 14:02:15
MD5: bdd76ad91aa88845d0791327f8675972
Shared: True
Parents: 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0
ViewUrl: https://drive.google.com/file/d/1bpWCqhadqw29Av9RsqAJqWW7kmEpQOHQ/view?usp=drivesdk

[82]:

!/home/vas744/Software/gdrive files upload --parent 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0 /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_raw_QC.h5ad

Uploading /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_raw_QC.h5ad
File successfully uploaded
Id: 1DaeasGmgNIBuppSkqT_jiRMDCouxEuiN
Name: HumanPBMCs_Multiome_RNA_raw_QC.h5ad
Mime: application/octet-stream
Size: 407.7 MB
Created: 2026-01-20 14:02:26
Modified: 2026-01-20 14:02:26
MD5: 97852a95d1b331836930c376a5aed3e5
Shared: True
Parents: 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0
ViewUrl: https://drive.google.com/file/d/1DaeasGmgNIBuppSkqT_jiRMDCouxEuiN/view?usp=drivesdk

[83]:

!/home/vas744/Software/gdrive files upload --parent 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0 /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_CellType_markerGenes.csv

Uploading /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_CellType_markerGenes.csv
File successfully uploaded
Id: 1BLabNMEkVlHfbooFe55jD9d1yrcl_66K
Name: HumanPBMCs_Multiome_RNA_CellType_markerGenes.csv
Mime: text/csv
Size: 6.1 KB
Created: 2026-01-20 14:02:28
Modified: 2026-01-20 14:02:28
MD5: 1ac8ebc9c6b336588015cb2be1ed9432
Shared: True
Parents: 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0
ViewUrl: https://drive.google.com/file/d/1BLabNMEkVlHfbooFe55jD9d1yrcl_66K/view?usp=drivesdk

[ ]:

Table of Contents

Pre-processing CellRanger outputs (multiple samples, Human PBMCs)#

Download the data#

Load the data#

Quality Control#

Filter cells based on gene counts#

Calculate cell quality metrics#

Doublet prediction#

log1p Normalization#

INFOG Normalization#

Dimensionality reduction#

Save the data#

Filtering cells#

Annotation#

Preprocessing cell type-specific marker gene sets#

1. INFOG Normalization#

Dimensionality reduction#

Predict cell types based on marker gene sets#

Clustering#

Identify marker genes with COSG#

Removal of low-quality cells#

Checking potential low quality clusters#

2. INFOG Normalization#

Dimensionality reduction#

Predict cell types based on marker gene sets#

Identify marker genes with COSG#

Marker genes for CD8aa monocyte#

Save the data#