Pre-processing CellRanger outputs (multiple samples, Human PBMCs)#

This notebook demonstrates the pipeline for processing scRNA-seq data, covering metric computation, normalization, clustering, cell type prediction, and the identification/removal of doublets and low-quality clusters. The final output is pre-processed, annotated scRNA-seq data.

[2]:
import piaso
import cosg
import numpy as np
import pandas as pd
import scanpy as sc
import logging
from sklearn.preprocessing import StandardScaler
import warnings
import os
import anndata as ad

from matplotlib import rcParams
sc.settings.set_figure_params(scanpy=True, dpi=80, dpi_save=300, frameon=True, vector_friendly=False, fontsize=14)
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[3]:
data_dir = "/n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023"
save_dir = "/n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2"
prefix = "HumanPBMCs_Multiome_RNA"

Download the data#

Human PBMC snMultiome data (SAN1, SAN2) from: De Rop, F.V., Hulselmans, G., Flerin, C. et al. Systematic benchmarking of single-cell ATAC-sequencing protocols. Nat Biotechnol 42, 916–926 (2024). https://doi.org/10.1038/s41587-023-01881-x

Also available via https://drive.google.com/file/d/1g1V-VTy2qHK7PJIbW-t9hRFzHCqFCe4T/view?usp=drive_link

[4]:
# !/home/vas744/Software/gdrive files download --destination /n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023 1g1V-VTy2qHK7PJIbW-t9hRFzHCqFCe4T
[10]:
!/home/vas744/Software/gdrive files download --destination /n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023/SAN2 1q5mXAE6YWVlzTJLSG0qUgnATitml0gzX
Downloading filtered_feature_bc_matrix.h5
Successfully downloaded filtered_feature_bc_matrix.h5
[9]:
!/home/vas744/Software/gdrive files download --destination /n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023/SAN1 1Bjr81oJWto-06ak14IXB804xcLypPbRj
Downloading filtered_feature_bc_matrix.h5
Successfully downloaded filtered_feature_bc_matrix.h5

Load the data#

[27]:
data_path='/n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023'
samples = [entry.name for entry in os.scandir(data_path) if entry.is_dir()]
print(samples)
['SAN1', 'SAN2']
[28]:
def load_multiple_anndata(
    data_path,
    samples,
    key_added):
    anndata_list = []
    for sample in samples:
        try:
            # Load the data
            file_name = os.listdir(data_path+'/'+sample+'/RNA')[0]
            print(file_name)
            h5_path=data_path+'/'+sample+'/RNA/'+file_name
            if os.path.exists(h5_path):
                adata = sc.read_10x_h5(h5_path)
                print('Loading the h5 file')

            else:
                adata = sc.read_10x_mtx(data_path+'/'+sample, cache=True)
            adata.var_names_make_unique()
            adata.obs_names_make_unique()
            adata.obs[key_added]=sample
            anndata_list.append(adata)
            print('Sample loaded: ', sample)
        except Exception as e:
            print(f"Error loading {data_path}: {e}")

    return anndata_list
[29]:
adata_list=load_multiple_anndata(data_path=data_path, samples=samples, key_added='Sample')
filtered_feature_bc_matrix.h5
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
Loading the h5 file
Sample loaded:  SAN1
filtered_feature_bc_matrix.h5
Loading the h5 file
Sample loaded:  SAN2
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
[30]:
adata_list
[30]:
[AnnData object with n_obs × n_vars = 3545 × 36601
     obs: 'Sample'
     var: 'gene_ids', 'feature_types', 'genome', 'interval',
 AnnData object with n_obs × n_vars = 4360 × 36601
     obs: 'Sample'
     var: 'gene_ids', 'feature_types', 'genome', 'interval']
[33]:
adata=ad.concat(adata_list, join='outer',index_unique="-")
adata
[33]:
AnnData object with n_obs × n_vars = 7905 × 36601
    obs: 'Sample'
[34]:
adata.X.data
[34]:
array([ 1.,  1.,  1., ..., 43., 18., 56.],
      shape=(12509350,), dtype=float32)

Create the ‘raw’ counts layer based on the counts data from the adata object.

[35]:
adata.layers['raw'] = adata.X.copy()
adata
[35]:
AnnData object with n_obs × n_vars = 7905 × 36601
    obs: 'Sample'
    layers: 'raw'
[36]:
adata.var
[36]:
MIR1302-2HG
FAM138A
OR4F5
AL627309.1
AL627309.3
...
AC141272.1
AC023491.2
AC007325.1
AC007325.4
AC007325.2

36601 rows × 0 columns

Quality Control#

Filter cells based on gene counts#

Filter out cells with fewer than 200 detected genes.

[37]:
sc.pp.filter_cells(adata, min_genes=200)

Calculate cell quality metrics#

We identify mitochondrial and ribosomal protein genes and compute their proportion in each cell’s total read count. A high proportion of these reads often indicates low-quality cells. Note: In human data, the gene names have uppercase letters (MT-, RPS-, RPL-). For mouse data, these gene names will start with Mt-, Rps-, Rpl-

[38]:
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
[39]:
adata.var['ribo'] = adata.var_names.str.startswith('RPS','RPL')
sc.pp.calculate_qc_metrics(adata, qc_vars=['ribo'], percent_top=None, log1p=False, inplace=True)
/tmp/ipykernel_2699886/3834852883.py:1: FutureWarning: Allowing a non-bool 'na' in obj.str.startswith is deprecated and will raise in a future version.
  adata.var['ribo'] = adata.var_names.str.startswith('RPS','RPL')
[40]:
piaso.pl.plot_features_violin(adata,
                              ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
                              width_single=2.0)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_24_0.png

Doublet prediction#

Next, we compute the scrublet score to identify and predict potential doublets.

[41]:
experiments=np.unique(adata.obs['Sample'])
adata.obs['scrublet_score']=np.repeat(0,adata.n_obs)
adata.obs['predicted_doublets']=np.repeat(False,adata.n_obs)
[42]:
import scrublet as scr

for experiment in experiments:
    print(experiment)
    adatai=adata[adata.obs['Sample']==experiment]

    scrub = scr.Scrublet(adatai.X.todense(),random_state=10)
    doublet_scores, predicted_doublets = scrub.scrub_doublets()

    adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets

    adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
SAN1
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.15
Detected doublet rate = 15.9%
Estimated detectable doublet fraction = 77.1%
Overall doublet rate:
        Expected   = 10.0%
        Estimated  = 20.6%
Elapsed time: 3.1 seconds
SAN2
/tmp/ipykernel_2699886/3464195482.py:10: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
/tmp/ipykernel_2699886/3464195482.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
/tmp/ipykernel_2699886/3464195482.py:12: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
/tmp/ipykernel_2699886/3464195482.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
/tmp/ipykernel_2699886/3464195482.py:12: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[0.14470842 0.05152225 0.06938776 ... 0.09004739 0.03024575 0.04338395]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
  adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.14
Detected doublet rate = 17.1%
Estimated detectable doublet fraction = 79.9%
Overall doublet rate:
        Expected   = 10.0%
        Estimated  = 21.5%
Elapsed time: 5.5 seconds
/tmp/ipykernel_2699886/3464195482.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
[43]:
piaso.pl.plot_features_violin(adata,
                              ['scrublet_score'],
                              groupby='Sample',
                              width_single=2)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_28_0.png

log1p Normalization#

[44]:
%%time
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

adata.layers['log1p']=adata.X.copy()
CPU times: user 198 ms, sys: 26.5 ms, total: 225 ms
Wall time: 124 ms

INFOG Normalization#

We use PIASO’s INFOG to normalize the data and identify a highly variable set of genes.

[45]:
%%time
piaso.tl.infog(adata,
               copy=False,
               inplace=False,
               n_top_genes=3000,
               key_added='infog',
               key_added_highly_variable_gene='highly_variable',
               layer='raw')
The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 590 ms, sys: 199 ms, total: 789 ms
Wall time: 792 ms

Dimensionality reduction#

[46]:
%%time
piaso.tl.runSVD(adata,
                use_highly_variable=True,
                n_components=30,
                random_state=10,
                key_added='X_svd',
                layer='infog')
CPU times: user 4.07 s, sys: 6.2 ms, total: 4.07 s
Wall time: 493 ms
[47]:
%%time
sc.pp.neighbors(adata,
                use_rep='X_svd',
                n_neighbors=15,
                random_state=10,
                knn=True,
                method="umap")

sc.tl.umap(adata)
CPU times: user 27.5 s, sys: 92.7 ms, total: 27.6 s
Wall time: 24.2 s
[48]:
sc.pl.umap(adata,
           color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
           cmap='Spectral_r',
           palette=piaso.pl.color.d_color1,
           ncols=2,
           size=10,
           frameon=False)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_36_0.png

Save the data#

[13]:
save_dir+'/'+prefix+'_raw_QC.h5ad'
[13]:
'/n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_raw_QC.h5ad'
[50]:
adata.write(save_dir+'/'+prefix+'_raw_QC.h5ad')
[14]:
# adata = sc.read(save_dir+'/'+prefix+'_raw_QC.h5ad')

Filtering cells#

Outlier cells of these thresholds are likely low quality cells, so we filter them out before further downstream analysis.

[15]:
adata=adata[adata.obs['n_genes_by_counts']>500].copy()
[16]:
adata=adata[adata.obs['n_genes_by_counts']<10000].copy()

Remove doublets based on a scrublet score threshold.

[17]:
adata = adata[adata.obs['scrublet_score']<0.3].copy()
adata
[17]:
AnnData object with n_obs × n_vars = 7218 × 36601
    obs: 'Sample', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'scrublet_score', 'predicted_doublets'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'ribo', 'infog_var', 'highly_variable'
    uns: 'Sample_colors', 'log1p', 'neighbors', 'umap'
    obsm: 'X_svd', 'X_umap'
    layers: 'infog', 'log1p', 'raw'
    obsp: 'connectivities', 'distances'
[18]:
sc.pl.umap(adata,
           color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
           cmap='Spectral_r',
           palette=piaso.pl.color.d_color1,
           ncols=2,
           size=10,
           frameon=False)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_46_0.png
[19]:
piaso.pl.plot_embeddings_split(adata,
                               color='n_genes_by_counts',
                               cmap=piaso.pl.color.c_color1,
                               palette=piaso.pl.color.d_color1,
                               layer=None,
                               splitby='Sample',
                               size=25,
                               frameon=False,)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_47_0.png

Annotation#

We used the precomputed Human PBMC marker gene sets from the AllenHumanImmuneHealthAtlas_L2 study in PIASOmarkerDB. The original dataset comes from https://www.nature.com/articles/s41586-025-09686-5.

[20]:
marker_gene_df = pd.read_csv("/n/scratch/users/v/vas744/Data/Public/10x_Human_PBMC/PIASOmarkerDB_AllenHumanImmuneHealthAtlas_L2_251219.csv")
marker_gene_df
[20]:
Cell_Type Gene Specificity_Score Study_Publication Species Tissue Condition
0 Platelet GP9 16.455748 AllenHumanImmuneHealthAtlas_L2 Human blood normal
1 Erythrocyte ALAS2 16.422475 AllenHumanImmuneHealthAtlas_L2 Human blood normal
2 Platelet CMTM5 16.400534 AllenHumanImmuneHealthAtlas_L2 Human blood normal
3 Platelet TMEM40 16.373334 AllenHumanImmuneHealthAtlas_L2 Human blood normal
4 Erythrocyte AHSP 16.338337 AllenHumanImmuneHealthAtlas_L2 Human blood normal
... ... ... ... ... ... ... ...
1445 Memory CD4 T cell FXYD7 1.783520 AllenHumanImmuneHealthAtlas_L2 Human blood normal
1446 Memory CD4 T cell CRIP1 1.779265 AllenHumanImmuneHealthAtlas_L2 Human blood normal
1447 Memory CD4 T cell CCDC167 1.775660 AllenHumanImmuneHealthAtlas_L2 Human blood normal
1448 Memory CD4 T cell CD6 1.775110 AllenHumanImmuneHealthAtlas_L2 Human blood normal
1449 Memory CD4 T cell TMEM238 1.769377 AllenHumanImmuneHealthAtlas_L2 Human blood normal

1450 rows × 7 columns

Preprocessing cell type-specific marker gene sets#

PIASO’s predictCellTypeByMarker expects the marker_gene_list in DataFrame or dict format. So here, we will create a dictionary with each cell type and their respective marker gene lists.

[21]:
marker_gene_dict = {}
cell_types = np.unique(marker_gene_df['Cell_Type'])
for cell_type in cell_types:
    marker_gene_dict[cell_type] = list(marker_gene_df[marker_gene_df['Cell_Type']==cell_type]['Gene'])

Reordering cell type categories

1. INFOG Normalization#

[22]:
%%time
piaso.tl.infog(adata,
               copy=False,
               inplace=False,
               n_top_genes=3000,
               key_added='infog',
               key_added_highly_variable_gene='highly_variable',
               layer='raw')
The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 747 ms, sys: 28.8 s, total: 29.5 s
Wall time: 29.6 s

Dimensionality reduction#

[23]:
%%time
piaso.tl.runSVD(adata,
                use_highly_variable=True,
                n_components=30,
                random_state=10,
                key_added='X_svd',
                layer='infog')
CPU times: user 5.36 s, sys: 134 ms, total: 5.5 s
Wall time: 894 ms
[24]:
%%time
sc.pp.neighbors(adata,
                use_rep='X_svd',
                n_neighbors=15,
                random_state=10,
                knn=True,
                method="umap")

sc.tl.umap(adata)
CPU times: user 39.3 s, sys: 308 ms, total: 39.6 s
Wall time: 36.3 s

Predict cell types based on marker gene sets#

Here, we use the use_score = False to make the predictions based on p-values, but you can also use the gene set scores with use_score = True.

[25]:
%%time
piaso.tl.predictCellTypeByMarker(adata,
                                 marker_gene_set=marker_gene_dict,
                                 score_method='piaso',
                                 use_rep='X_svd',
                                 use_score=False,
                                 smooth_prediction=True,
                                 inplace=True)
Calculating gene set scores using piaso method...
Scoring gene sets: 100%|██████████| 29/29 [00:16<00:00,  1.75set/s]
Predicting cell types based on marker gene p-values...
Smoothing cell type predictions...
Smoothing cell type predictions from 'CellTypes_predicted_raw' using 7-nearest neighbors
Smoothed predictions stored in adata.obs['CellTypes_predicted_smoothed']
Confidence scores stored in adata.obs['CellTypes_predicted_smoothed_confidence']
Modified 1632 cell labels (22.61% of total)
Cell type prediction completed. Results saved to:
  - adata.obs['CellTypes_predicted']: predicted cell types
  - adata.obsm['CellTypes_predicted_score']: full score matrix
  - adata.obsm['CellTypes_predicted_nlog10pvals']: full -log10(p-value) matrix
  - adata.obs['CellTypes_predicted_nlog10pvals']: maximum -log10(p-values)
  - adata.obs['CellTypes_predicted_raw']: original unsmoothed predictions
  - adata.obs['CellTypes_predicted_confidence_smoothed']: smoothing confidence scores
CPU times: user 6.85 s, sys: 146 ms, total: 6.99 s
Wall time: 24.6 s
[26]:
category_dict = {"Human_PBMC": ['ASDC', 'CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
                                'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell',
                                'Erythrocyte', 'ILC', 'Intermediate monocyte', 'MAIT', 'Memory B cell',
                                'Memory CD4 T cell', 'Memory CD8 T cell', 'Naive B cell',
                                'Naive CD4 T cell', 'Naive CD8 T cell', 'Plasma cell', 'Platelet',
                                'Progenitor cell','Proliferating NK cell', 'Proliferating T cell',
                                'Transitional B cell', 'Treg', 'cDC1','cDC2', 'gdT','pDC']}
[27]:
cell_types
[27]:
array(['ASDC', 'CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
       'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell',
       'Erythrocyte', 'ILC', 'Intermediate monocyte', 'MAIT',
       'Memory B cell', 'Memory CD4 T cell', 'Memory CD8 T cell',
       'Naive B cell', 'Naive CD4 T cell', 'Naive CD8 T cell',
       'Plasma cell', 'Platelet', 'Progenitor cell',
       'Proliferating NK cell', 'Proliferating T cell',
       'Transitional B cell', 'Treg', 'cDC1', 'cDC2', 'gdT', 'pDC'],
      dtype=object)
[28]:
np.unique(adata.obs['CellTypes_predicted'])
[28]:
array(['CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
       'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell', 'ILC',
       'Intermediate monocyte', 'MAIT', 'Memory B cell',
       'Memory CD4 T cell', 'Memory CD8 T cell', 'Naive B cell',
       'Naive CD4 T cell', 'Naive CD8 T cell', 'Plasma cell', 'Platelet',
       'Progenitor cell', 'Proliferating NK cell', 'Transitional B cell',
       'Treg', 'cDC1', 'cDC2', 'gdT', 'pDC'], dtype=object)
[31]:
adata.obs['CellTypes_predicted'] = adata.obs['CellTypes_predicted'].astype('category')
[32]:
adata.obs['CellTypes_predicted'] = adata.obs['CellTypes_predicted'].cat.add_categories(
    [cat for cat in category_dict["Human_PBMC"]
     if cat not in adata.obs['CellTypes_predicted'].cat.categories]
)

adata.obs['CellTypes_predicted'] = adata.obs['CellTypes_predicted'].cat.reorder_categories(
    category_dict["Human_PBMC"]
)
adata
[32]:
AnnData object with n_obs × n_vars = 7218 × 36601
    obs: 'Sample', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'scrublet_score', 'predicted_doublets', 'CellTypes_predicted', 'CellTypes_predicted_nlog10pvals', 'CellTypes_predicted_raw', 'CellTypes_predicted_smoothed', 'CellTypes_predicted_smoothed_confidence', 'CellTypes_predicted_confidence_smoothed'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'ribo', 'infog_var', 'highly_variable'
    uns: 'Sample_colors', 'log1p', 'neighbors', 'umap'
    obsm: 'X_svd', 'X_umap', 'CellTypes_predicted_score', 'CellTypes_predicted_nlog10pvals'
    layers: 'infog', 'log1p', 'raw'
    obsp: 'connectivities', 'distances'
[33]:
sc.pl.umap(adata,
           color=['CellTypes_predicted'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=8,
           legend_fontoutline=1,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_67_0.png

Clustering#

[34]:
%%time
sc.tl.leiden(adata,resolution=2.5,key_added='Leiden')
<timed eval>:1: FutureWarning: In the future, the default backend for leiden will be igraph instead of leidenalg.

 To achieve the future defaults please pass: flavor="igraph" and n_iterations=2.  directed must also be False to work with igraph's implementation.
CPU times: user 4.01 s, sys: 16.3 ms, total: 4.02 s
Wall time: 4.05 s
[35]:
sc.pl.umap(adata,
           color=['CellTypes_predicted','Leiden'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=12,
           legend_fontoutline=2,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_70_0.png

Identify marker genes with COSG#

[36]:
%%time
n_gene=30
cosg.cosg(adata,
          key_added='cosg',
          use_raw=False,
          layer='infog',
          mu=100,
          expressed_pct=0.1,
          remove_lowly_expressed=True,
          n_genes_user=n_gene,
          groupby='Leiden')
CPU times: user 915 ms, sys: 540 ms, total: 1.46 s
Wall time: 1.47 s
[37]:
sc.tl.dendrogram(adata,groupby='Leiden',use_rep='X_svd')
df_tmp=pd.DataFrame(adata.uns['cosg']['names'][:3,]).T
df_tmp=df_tmp.reindex(adata.uns['dendrogram_'+'Leiden']['categories_ordered'])
marker_genes_list={idx: list(row.values) for idx, row in df_tmp.iterrows()}
marker_genes_list = {k: v for k, v in marker_genes_list.items() if not any(isinstance(x, float) for x in v)}

sc.pl.dotplot(adata,
              marker_genes_list,
              groupby='Leiden',
              layer='infog',
              dendrogram=True,
              swap_axes=False,
              standard_scale='var',
              cmap='Spectral_r')
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_73_0.png
[38]:
marker_gene=pd.DataFrame(adata.uns['cosg']['names'])

Removal of low-quality cells#

Here, we identify and remove some low quality clusters based on cell quality metrics and marker gene sets.

[39]:
piaso.pl.plot_features_violin(adata,
                              ['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
                              groupby='Leiden')
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_76_0.png

Some of the cells seems to have higher percentage of mitochondrial and ribosomal gene expression, indicating that they are low quality.

[40]:
adata = adata[adata.obs['pct_counts_mt']<20].copy()
[41]:
adata = adata[adata.obs['pct_counts_ribo']<5].copy()
[42]:
sc.pl.umap(adata,
           color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
           cmap='Spectral_r',
           palette=piaso.pl.color.d_color1,
           ncols=2,
           size=10,
           frameon=False)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_80_0.png
[61]:
sc.pl.umap(adata,
           color=['Leiden'],
           groups=['29'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=12,
           legend_fontoutline=2,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_81_0.png

Checking potential low quality clusters#

[65]:
cluster_check = '20'
marker_gene[cluster_check].values
[65]:
array(['RPL19', 'RPL32', 'RPL13A', 'RPL28', 'RPS25', 'RPS8', 'RPLP2',
       'RPL14', 'RPS12', 'RPS27', 'MKI67', 'RPL22', 'RPL23A', 'RPS15A',
       'RPL35', 'RPL13', 'RPS18', 'RPL3', 'RPL17', 'RPS3', 'EEF1A1',
       'RPS20', 'RPL12', 'RPS3A', 'RPL18A', 'RPLP0', 'RPS23', 'RPL39',
       'RPS6', 'RPL31'], dtype=object)
[63]:
sc.pl.umap(adata,
           color=['IGHA1', 'SDC1', 'IGLV2-11', 'JCHAIN', 'IGHG4', 'TNFRSF17',
       'IGLV2-14', 'GLDC', 'IGHV3-15', 'TXNDC5', 'IGHG2', 'MIXL1', 'MZB1'],
           cmap=piaso.pl.color.c_color1,
           palette=piaso.pl.color.d_color1,
           ncols=3,
           size=10,
           frameon=False)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_84_0.png
[98]:
sc.pl.dotplot(adata,
              marker_gene[cluster_check].values[:30],
              groupby='Leiden',
              dendrogram=False,
              swap_axes=True,
              standard_scale='var',
              cmap='Spectral_r')
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_85_0.png
[64]:
sc.pl.umap(adata,
           color=['Leiden'],
           groups=['17','20','27','28'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=12,
           legend_fontoutline=2,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_86_0.png
[45]:
piaso.pl.plot_embeddings_split(adata,
                               color='CellTypes_predicted',
                               layer=None,
                               splitby='Sample',
                               size=25,
                               frameon=False,)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_87_0.png
[66]:
adata=adata[~adata.obs['Leiden'].isin(['17','20','27','28'])].copy()
adata
[66]:
AnnData object with n_obs × n_vars = 6591 × 36601
    obs: 'Sample', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'scrublet_score', 'predicted_doublets', 'CellTypes_predicted', 'CellTypes_predicted_nlog10pvals', 'CellTypes_predicted_raw', 'CellTypes_predicted_smoothed', 'CellTypes_predicted_smoothed_confidence', 'CellTypes_predicted_confidence_smoothed', 'Leiden'
    var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'ribo', 'infog_var', 'highly_variable'
    uns: 'Sample_colors', 'log1p', 'neighbors', 'umap', 'CellTypes_predicted_colors', 'Leiden', 'Leiden_colors', 'cosg', 'dendrogram_Leiden'
    obsm: 'X_svd', 'X_umap', 'CellTypes_predicted_score', 'CellTypes_predicted_nlog10pvals'
    layers: 'infog', 'log1p', 'raw'
    obsp: 'connectivities', 'distances'

2. INFOG Normalization#

It is important to re-run INFOG normalization and dimensionality reduction after filtering out cells, to accurately capture the current cell population.

[67]:
%%time
piaso.tl.infog(adata,
               copy=False,
               inplace=False,
               n_top_genes=3000,
               key_added='infog',
               key_added_highly_variable_gene='highly_variable',
               layer='raw')
The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 685 ms, sys: 4.9 s, total: 5.59 s
Wall time: 5.61 s

Dimensionality reduction#

[68]:
%%time
piaso.tl.runSVD(adata,
                use_highly_variable=True,
                n_components=30,
                random_state=10,
                key_added='X_svd',
                layer='infog')
CPU times: user 4.72 s, sys: 0 ns, total: 4.72 s
Wall time: 656 ms
[69]:
%%time
sc.pp.neighbors(adata,
                use_rep='X_svd',
                n_neighbors=15,
                random_state=10,
                knn=True,
                method="umap")

sc.tl.umap(adata)
CPU times: user 25.4 s, sys: 11.8 ms, total: 25.4 s
Wall time: 21 s

Predict cell types based on marker gene sets#

[70]:
%%time
piaso.tl.predictCellTypeByMarker(adata,
                                 marker_gene_set=marker_gene_dict,
                                 score_method='piaso',
                                 use_rep='X_svd',
                                 use_score=False,
                                 key_added = 'CellTypes_predicted',
                                 smooth_prediction=True,
                                 inplace=True)
Calculating gene set scores using piaso method...
Scoring gene sets: 100%|██████████| 29/29 [00:17<00:00,  1.69set/s]
Predicting cell types based on marker gene p-values...
Smoothing cell type predictions...
Smoothing cell type predictions from 'CellTypes_predicted_raw' using 7-nearest neighbors
Smoothed predictions stored in adata.obs['CellTypes_predicted_smoothed']
Confidence scores stored in adata.obs['CellTypes_predicted_smoothed_confidence']
Modified 1402 cell labels (21.27% of total)
Cell type prediction completed. Results saved to:
  - adata.obs['CellTypes_predicted']: predicted cell types
  - adata.obsm['CellTypes_predicted_score']: full score matrix
  - adata.obsm['CellTypes_predicted_nlog10pvals']: full -log10(p-value) matrix
  - adata.obs['CellTypes_predicted_nlog10pvals']: maximum -log10(p-values)
  - adata.obs['CellTypes_predicted_raw']: original unsmoothed predictions
  - adata.obs['CellTypes_predicted_confidence_smoothed']: smoothing confidence scores
CPU times: user 10.3 s, sys: 103 ms, total: 10.4 s
Wall time: 24.6 s
[71]:
sc.pl.umap(adata,
           color=['CellTypes_predicted'],
           groups=['CD8aa'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=8,
           legend_fontoutline=1,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_96_0.png
[72]:
sc.pl.umap(adata,
           color=['CellTypes_predicted'],
           # groups=['CD8aa'],
           palette=piaso.pl.color.d_color20,
           legend_fontsize=8,
           legend_fontoutline=1,
           # legend_loc='on data',
           ncols=1,
           size=10,
           frameon=False)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_97_0.png

Identify marker genes with COSG#

[73]:
%%time
n_gene=30
cosg.cosg(adata,
          key_added='cosg',
          use_raw=False,
          layer='infog',
          mu=100,
          expressed_pct=0.1,
          remove_lowly_expressed=True,
          n_genes_user=n_gene,
          groupby='CellTypes_predicted')
CPU times: user 723 ms, sys: 98.9 ms, total: 822 ms
Wall time: 825 ms
[74]:
marker_gene=pd.DataFrame(adata.uns['cosg']['names'])
marker_gene
[74]:
ASDC CD8aa CD14 monocyte CD16 monocyte CD56bright NK cell CD56dim NK cell DN T cell Effector B cell Erythrocyte ILC ... Naive CD4 T cell Naive CD8 T cell Plasma cell Proliferating NK cell Transitional B cell Treg cDC1 cDC2 gdT pDC
0 IGLV5-52 AC005944.1 DYSF AC104809.2 XCL1 BNC2 AC004490.1 SOX5 LINC01241 LHX6 ... ANKRD55 AL590006.1 IGHA1 AL096677.2 RAPGEF5 FOXP3 PLS3-AS1 FCER1A TRDC LINC01478
1 AC012313.2 AL031590.1 CRISPLD2 CDKN1C SPTSSB MYOM2 AP001360.2 PTCHD1-AS AL139023.1 AL353689.3 ... EDAR NRCAM IGLV7-43 IGKV1-9 TTC28 LINC02694 STRA6 CLEC10A TRGC1 AC097375.1
2 EPB41L4B TNIP3 VCAN GPR20 CABLES1 LINGO2 AL162427.1 FCRL5 AL021937.4 AC005520.3 ... FHIT NELL2 JCHAIN IGHV3-66 MYBL2 HACD1 AC090907.1 ENHO TRGV9 CUX2
3 AC145207.1 ZSCAN26 PADI4 LYPD2 TNFRSF11A SPON2 PGAM4 CLECL1 AC105213.1 AL109955.2 ... PLCL1 LINC02446 TNFRSF17 AC027279.2 GPM6A AC013652.1 DCLK3 CD1E KLRC1 LRRC26
4 SLC24A5 PDGFB NRG1 PPP1R17 PPP1R9A MIR181A2HG ALDH3B2 AL592429.2 CCL26 AL662844.3 ... KANK1 GLIS3 IGHV3-15 DUSP13 AL031595.3 FANK1 AC108169.1 CD1C CCL5 AC023590.1
5 LINC00664 AL121839.2 FCAR AC020651.2 ZMAT4 GRIK4 AC007029.1 GPM6A AC074011.1 AC007370.1 ... TSHZ2 CA6 TXNDC5 TCF23 HS3ST1 RTKN2 ANKRD65 ST18 A2M EPHB1
6 MOBP AMIGO1 LUCAT1 CROCC2 SERPINE1 SORBS2 AC010255.1 TNFRSF13B AC026765.3 AC007998.4 ... MAL CD8B IGKV3-15 TRHR P2RX5 DUSP4 AL118508.1 NCKAP5 PZP LILRA4
7 AL138999.2 ZSCAN2 AL163541.1 VMO1 KLRC2 FGFBP2 LINC01865 NIPSNAP3B C12orf56 LRRC66 ... LEF1 FCGBP MIXL1 FAM215A AC060234.3 CTLA4 AC026391.1 PKIB GNLY PTPRS
8 ZSCAN23 AL359706.1 CXCL8 CKB AC002429.2 CD160 AL121757.2 MS4A1 SPATA41 AC084082.1 ... ITGA6 LRRN3 IGLC2 AL133297.1 JUP STAM PRSS48 MRC1 IL12RB2 PTCRA
9 AL445123.2 STYK1 MCEMP1 LINC02345 RAMP1 LINC01505 TEK ZBTB20-AS5 CMTM2 PCAT6 ... MLLT3 PDE3B IGHV3-30 LINC01005 BCL7A IL2RA ALOX15 LRP1B S100B LAMP5
10 AC002347.1 ZNF681 LIN7A LINC02085 IGFBP4 RHOBTB3 AC116407.1 EML6 AP001005.3 LINC02815 ... FAM153A THEMIS MZB1 AC006960.3 EFCAB11 CDHR3 CYP24A1 CLIC2 LINC00612 KCNA5
11 STRA8 RNFT2 CSF3R HES4 STYK1 SH2D1B AC020907.1 ANKRD33B AC006116.8 NOL4L-DT ... LINC01891 APBA2 SDC1 AC022517.1 CORO2B ZC2HC1A ENPP1 IL1R2 KLRG1 CLEC4C
12 AP001094.2 METTL18 NLRP12 FMNL2 NCAM1 RNF165 AC027130.1 IL7 FXYD6 AC116158.2 ... PRKCA SNTG2 GPRC5D PKMYT1 AL365272.1 AC011997.1 DNAAF3 HTR7 TGFBR3 AL513493.1
13 AC112512.1 TMEM116 WLS CEACAM3 IL12RB2 LINC00298 AC090125.1 OSBPL10 TNS1 TBC1D8-AS1 ... AK5 CACHD1 GLDC IGLC1 PAOX IKZF2 CFB DTNA DUSP2 SCT
14 NPAS1 BCAT2 PRICKLE1 SMIM25 KLRC3 PRF1 AC018437.3 CD82 PPFIA3 OTX2-AS1 ... MAN1C1 LEF1 IGHV3-33 TERT TCL1A MCF2L2 ARHGEF26-AS1 FLT3 CYB561 FAM160A1-DT
15 GATA4 VIPR1-AS1 SPOCK1 NEURL1 ROR1 KLRF1 GFRA3 AL355076.2 AC005332.3 NGB ... FAM153B PSMA1 PLAAT2 PAGE2B AL139020.1 CCDC141 AC107918.5 ADGRG6 SH3BGRL3 LINC01724
16 C20orf144 ZNF268 GLT1D1 CLEC4F EFHC2 IGFBP7 LINC01339 ANGPTL1 AL589935.1 CCDC40 ... SLC22A23 SH3RF3 IGHV3-23 IGLV2-11 MCF2L TTN LINC01845 SH3BP4 AC010175.1 AC018643.1
17 EXOC3L2 LYSMD4 CYP1B1 PAPSS2 XCL2 NMUR1 AC009435.1 FCRL2 AC008649.2 AL161757.4 ... TRABD2A LEF1-AS1 IGKC CDC45 TCL6 F5 AL122008.4 CST3 A2M-AS1 LINC01226
18 VWC2 KLRC4 ACSL1 TCF7L2 CCNJL SLCO4C1 AC015631.1 RALGPS2 C17orf64 AF233439.1 ... IL6ST AC002460.2 DERL3 IGHV5-51 IGHD TOX ADORA3 SLAMF8 KLRD1 AC011893.1
19 CLEC18A TNFRSF9 VCAN-AS1 C1QA B4GALT6 LDB2 CA10 FCRLA AC011092.3 MIA ... EPHX2 EPHA1-AS1 ITGA8 AC100827.4 CR2 IATPR CTSV DANT2 RAB9A AR
20 LINC02033 STK36 LINC00937 KNDC1 TNFRSF18 PRSS23 AL139042.1 GEN1 AC069549.1 STXBP5L ... AC139720.1 OXNAD1 ZNF215 IGHGP PPAN DCP1B CERS6-AS1 CYP2S1 NKG7 PHEX
21 AC025263.1 LINC01118 PROK2 ICAM4 GNLY AKR1C3 AC092844.1 ACP5 LRRC42 HCRTR1 ... CAMK4 CCR7 AC136431.1 AC012676.3 CD79B AC093865.1 CLEC9A NEGR1 C1orf21 COL26A1
22 AC091180.2 C21orf62-AS1 TREM1 SFTPD KLRC1 PDGFD IMPG1 CD79B SLC29A1 AC096570.1 ... CCR7 SCML1 IGF1 ANLN DPF3 CYTOR AC108449.2 HLA-DRB1 GZMH LINC00865
23 NRXN2 KLHL25 EREG LST1 ATP8B4 CHST2 C18orf63 CD79A TPRN SRR ... IGF1R CD8A APOBEC3B AC003043.1 KMO DUSP16 BBOX1 LHFPL6 GZMA NRP1
24 SIGLEC15 SLC35C1 AC020916.1 FCGR3A NCALD IL2RB AURKC EBF1 ZBED5-AS1 AC091057.6 ... NDFIP1 PRKAR1B IGLV2-14 AL359091.5 IGHM PHF19 AC093158.1 ITGB5 GZMM CYYR1
25 AL138694.1 DHRS13 PLAUR PDE8B HOPX NCR1 GPR176 DENND5B LINC01010 AL359504.2 ... AL109930.1 MAML2 AC009570.2 MELK CD22 NIBAN1 AC092661.1 HLA-DPB1 IL32 ARPP21
26 AC008764.6 MED12 CD36 IFITM3 CD7 GFOD1 AC087564.1 TLR9 BRI3BP HTR2C ... KLHL29 LRRC7 DCC GINS2 PCDH9 COX10-AS1 LDLRAD2 FBLN2 GZMK FAM160A1
27 AL031709.1 TRPC1 TNFAIP6 FAM20A IL2RB CLIC3 PRSS51 ZDHHC21 PDXP AL035658.1 ... LINC01550 NR3C2 BMP6 APOBEC3B-AS1 CEP89 LINC00426 IAPP HLA-DPA1 LYAR AC104819.3
28 PCBP2-OT1 TULP2 CYP1B1-AS1 MS4A7 NCR1 TTC38 LINC00970 TLR10 ZNF653 MRVI1 ... CHRM3-AS2 SERINC5 NT5DC2 SPC25 ZNF318 LDLRAD4 AC133961.1 ZNF366 HCST SRPX
29 CPT1C AC137810.1 RBM47 SIGLEC10 CTSW ADGRG1 LINC01090 AC090541.1 LRRC37A AC036176.1 ... TCF7 ABLIM1 ITM2C AC015878.1 FCRLA CASK CADM1 EMP1 TRGC2 LINC00996

30 rows × 26 columns

We can use a dendrogram dot plot to visualize the expression of the top three marker genes of each predicted cell type.

[75]:
sc.tl.dendrogram(adata,groupby='CellTypes_predicted',use_rep='X_svd')
df_tmp=pd.DataFrame(adata.uns['cosg']['names'][:3,]).T
df_tmp=df_tmp.reindex(adata.uns['dendrogram_'+'CellTypes_predicted']['categories_ordered'])
marker_genes_list={idx: list(row.values) for idx, row in df_tmp.iterrows()}
marker_genes_list = {k: v for k, v in marker_genes_list.items() if not any(isinstance(x, float) for x in v)}

sc.pl.dotplot(adata,
              marker_genes_list,
              groupby='CellTypes_predicted',
              layer='infog',
              dendrogram=True,
              swap_axes=False,
              standard_scale='var',
              cmap='Spectral_r')
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_102_0.png

Marker genes for CD8aa monocyte#

[78]:
cluster_check = 'CD8aa'
marker_gene[cluster_check].values
[78]:
array(['AC005944.1', 'AL031590.1', 'TNIP3', 'ZSCAN26', 'PDGFB',
       'AL121839.2', 'AMIGO1', 'ZSCAN2', 'AL359706.1', 'STYK1', 'ZNF681',
       'RNFT2', 'METTL18', 'TMEM116', 'BCAT2', 'VIPR1-AS1', 'ZNF268',
       'LYSMD4', 'KLRC4', 'TNFRSF9', 'STK36', 'LINC01118', 'C21orf62-AS1',
       'KLHL25', 'SLC35C1', 'DHRS13', 'MED12', 'TRPC1', 'TULP2',
       'AC137810.1'], dtype=object)
[79]:
sc.pl.umap(adata,
           color=['RNF26', 'BCKDK', 'STYK1', 'ZNF268', 'RNFT2', 'BICRA-AS1',
       'ZNF510', 'AMIGO1', 'MIER3', 'AC013437.1', 'SLC24A1', 'LAGE3'],
           cmap=piaso.pl.color.c_color1,
           palette=piaso.pl.color.d_color1,
           ncols=3,
           size=10,
           frameon=False)
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_105_0.png
[80]:
sc.pl.dotplot(adata,
              marker_gene[cluster_check].values[:30],
              groupby='CellTypes_predicted',
              dendrogram=False,
              swap_axes=True,
              standard_scale='var',
              cmap='Spectral_r')
../_images/notebooks_05_scRNAseqProcessingTutorial_SAN1SAN2_106_0.png

Save the data#

[138]:
np.unique(adata.obs['CellTypes_predicted'])
[138]:
array(['ASDC', 'CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
       'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell',
       'Erythrocyte', 'ILC', 'Intermediate monocyte', 'MAIT',
       'Memory B cell', 'Memory CD4 T cell', 'Memory CD8 T cell',
       'Naive B cell', 'Naive CD4 T cell', 'Naive CD8 T cell',
       'Plasma cell', 'Proliferating NK cell', 'Transitional B cell',
       'Treg', 'cDC1', 'cDC2', 'gdT', 'pDC'], dtype=object)
[76]:
adata.write(save_dir + '/' + prefix + '_QC.h5ad')
[5]:
adata = sc.read(save_dir + '/' + prefix + '_QC.h5ad')
[77]:
marker_gene.to_csv(save_dir + '/' + prefix + '_CellType_markerGenes.csv')
[81]:
!/home/vas744/Software/gdrive files upload --parent 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0 /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_QC.h5ad
Uploading /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_QC.h5ad
File successfully uploaded
Id: 1bpWCqhadqw29Av9RsqAJqWW7kmEpQOHQ
Name: HumanPBMCs_Multiome_RNA_QC.h5ad
Mime: application/octet-stream
Size: 318.5 MB
Created: 2026-01-20 14:02:15
Modified: 2026-01-20 14:02:15
MD5: bdd76ad91aa88845d0791327f8675972
Shared: True
Parents: 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0
ViewUrl: https://drive.google.com/file/d/1bpWCqhadqw29Av9RsqAJqWW7kmEpQOHQ/view?usp=drivesdk
[82]:
!/home/vas744/Software/gdrive files upload --parent 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0 /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_raw_QC.h5ad
Uploading /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_raw_QC.h5ad
File successfully uploaded
Id: 1DaeasGmgNIBuppSkqT_jiRMDCouxEuiN
Name: HumanPBMCs_Multiome_RNA_raw_QC.h5ad
Mime: application/octet-stream
Size: 407.7 MB
Created: 2026-01-20 14:02:26
Modified: 2026-01-20 14:02:26
MD5: 97852a95d1b331836930c376a5aed3e5
Shared: True
Parents: 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0
ViewUrl: https://drive.google.com/file/d/1DaeasGmgNIBuppSkqT_jiRMDCouxEuiN/view?usp=drivesdk
[83]:
!/home/vas744/Software/gdrive files upload --parent 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0 /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_CellType_markerGenes.csv
Uploading /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_CellType_markerGenes.csv
File successfully uploaded
Id: 1BLabNMEkVlHfbooFe55jD9d1yrcl_66K
Name: HumanPBMCs_Multiome_RNA_CellType_markerGenes.csv
Mime: text/csv
Size: 6.1 KB
Created: 2026-01-20 14:02:28
Modified: 2026-01-20 14:02:28
MD5: 1ac8ebc9c6b336588015cb2be1ed9432
Shared: True
Parents: 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0
ViewUrl: https://drive.google.com/file/d/1BLabNMEkVlHfbooFe55jD9d1yrcl_66K/view?usp=drivesdk
[ ]: