Pre-processing CellRanger outputs (multiple samples, Human PBMCs)#
This notebook demonstrates the pipeline for processing scRNA-seq data, covering metric computation, normalization, clustering, cell type prediction, and the identification/removal of doublets and low-quality clusters. The final output is pre-processed, annotated scRNA-seq data.
[2]:
import piaso
import cosg
import numpy as np
import pandas as pd
import scanpy as sc
import logging
from sklearn.preprocessing import StandardScaler
import warnings
import os
import anndata as ad
from matplotlib import rcParams
sc.settings.set_figure_params(scanpy=True, dpi=80, dpi_save=300, frameon=True, vector_friendly=False, fontsize=14)
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['ps.fonttype'] = 42
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[3]:
data_dir = "/n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023"
save_dir = "/n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2"
prefix = "HumanPBMCs_Multiome_RNA"
Download the data#
Human PBMC snMultiome data (SAN1, SAN2) from: De Rop, F.V., Hulselmans, G., Flerin, C. et al. Systematic benchmarking of single-cell ATAC-sequencing protocols. Nat Biotechnol 42, 916–926 (2024). https://doi.org/10.1038/s41587-023-01881-x
Also available via https://drive.google.com/file/d/1g1V-VTy2qHK7PJIbW-t9hRFzHCqFCe4T/view?usp=drive_link
[4]:
# !/home/vas744/Software/gdrive files download --destination /n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023 1g1V-VTy2qHK7PJIbW-t9hRFzHCqFCe4T
[10]:
!/home/vas744/Software/gdrive files download --destination /n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023/SAN2 1q5mXAE6YWVlzTJLSG0qUgnATitml0gzX
Downloading filtered_feature_bc_matrix.h5
Successfully downloaded filtered_feature_bc_matrix.h5
[9]:
!/home/vas744/Software/gdrive files download --destination /n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023/SAN1 1Bjr81oJWto-06ak14IXB804xcLypPbRj
Downloading filtered_feature_bc_matrix.h5
Successfully downloaded filtered_feature_bc_matrix.h5
Load the data#
[27]:
data_path='/n/scratch/users/v/vas744/Data/Public/PBMCMultiomeRop2023'
samples = [entry.name for entry in os.scandir(data_path) if entry.is_dir()]
print(samples)
['SAN1', 'SAN2']
[28]:
def load_multiple_anndata(
data_path,
samples,
key_added):
anndata_list = []
for sample in samples:
try:
# Load the data
file_name = os.listdir(data_path+'/'+sample+'/RNA')[0]
print(file_name)
h5_path=data_path+'/'+sample+'/RNA/'+file_name
if os.path.exists(h5_path):
adata = sc.read_10x_h5(h5_path)
print('Loading the h5 file')
else:
adata = sc.read_10x_mtx(data_path+'/'+sample, cache=True)
adata.var_names_make_unique()
adata.obs_names_make_unique()
adata.obs[key_added]=sample
anndata_list.append(adata)
print('Sample loaded: ', sample)
except Exception as e:
print(f"Error loading {data_path}: {e}")
return anndata_list
[29]:
adata_list=load_multiple_anndata(data_path=data_path, samples=samples, key_added='Sample')
filtered_feature_bc_matrix.h5
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
Loading the h5 file
Sample loaded: SAN1
filtered_feature_bc_matrix.h5
Loading the h5 file
Sample loaded: SAN2
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
/n/data1/hms/neurobio/fishell/vallaris/.conda/envs/piaso_env/lib/python3.10/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
[30]:
adata_list
[30]:
[AnnData object with n_obs × n_vars = 3545 × 36601
obs: 'Sample'
var: 'gene_ids', 'feature_types', 'genome', 'interval',
AnnData object with n_obs × n_vars = 4360 × 36601
obs: 'Sample'
var: 'gene_ids', 'feature_types', 'genome', 'interval']
[33]:
adata=ad.concat(adata_list, join='outer',index_unique="-")
adata
[33]:
AnnData object with n_obs × n_vars = 7905 × 36601
obs: 'Sample'
[34]:
adata.X.data
[34]:
array([ 1., 1., 1., ..., 43., 18., 56.],
shape=(12509350,), dtype=float32)
Create the ‘raw’ counts layer based on the counts data from the adata object.
[35]:
adata.layers['raw'] = adata.X.copy()
adata
[35]:
AnnData object with n_obs × n_vars = 7905 × 36601
obs: 'Sample'
layers: 'raw'
[36]:
adata.var
[36]:
| MIR1302-2HG |
|---|
| FAM138A |
| OR4F5 |
| AL627309.1 |
| AL627309.3 |
| ... |
| AC141272.1 |
| AC023491.2 |
| AC007325.1 |
| AC007325.4 |
| AC007325.2 |
36601 rows × 0 columns
Quality Control#
Filter cells based on gene counts#
Filter out cells with fewer than 200 detected genes.
[37]:
sc.pp.filter_cells(adata, min_genes=200)
Calculate cell quality metrics#
We identify mitochondrial and ribosomal protein genes and compute their proportion in each cell’s total read count. A high proportion of these reads often indicates low-quality cells. Note: In human data, the gene names have uppercase letters (MT-, RPS-, RPL-). For mouse data, these gene names will start with Mt-, Rps-, Rpl-
[38]:
adata.var['mt'] = adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
[39]:
adata.var['ribo'] = adata.var_names.str.startswith('RPS','RPL')
sc.pp.calculate_qc_metrics(adata, qc_vars=['ribo'], percent_top=None, log1p=False, inplace=True)
/tmp/ipykernel_2699886/3834852883.py:1: FutureWarning: Allowing a non-bool 'na' in obj.str.startswith is deprecated and will raise in a future version.
adata.var['ribo'] = adata.var_names.str.startswith('RPS','RPL')
[40]:
piaso.pl.plot_features_violin(adata,
['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo'],
width_single=2.0)
Doublet prediction#
Next, we compute the scrublet score to identify and predict potential doublets.
[41]:
experiments=np.unique(adata.obs['Sample'])
adata.obs['scrublet_score']=np.repeat(0,adata.n_obs)
adata.obs['predicted_doublets']=np.repeat(False,adata.n_obs)
[42]:
import scrublet as scr
for experiment in experiments:
print(experiment)
adatai=adata[adata.obs['Sample']==experiment]
scrub = scr.Scrublet(adatai.X.todense(),random_state=10)
doublet_scores, predicted_doublets = scrub.scrub_doublets()
adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
SAN1
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.15
Detected doublet rate = 15.9%
Estimated detectable doublet fraction = 77.1%
Overall doublet rate:
Expected = 10.0%
Estimated = 20.6%
Elapsed time: 3.1 seconds
SAN2
/tmp/ipykernel_2699886/3464195482.py:10: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:
df["col"][row_indexer] = value
Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
/tmp/ipykernel_2699886/3464195482.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
/tmp/ipykernel_2699886/3464195482.py:12: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:
df["col"][row_indexer] = value
Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
/tmp/ipykernel_2699886/3464195482.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
/tmp/ipykernel_2699886/3464195482.py:12: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas. Value '[0.14470842 0.05152225 0.06938776 ... 0.09004739 0.03024575 0.04338395]' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.
adata.obs['scrublet_score'][adatai.obs_names]=doublet_scores
Preprocessing...
Simulating doublets...
Embedding transcriptomes using PCA...
Calculating doublet scores...
Automatically set threshold at doublet score = 0.14
Detected doublet rate = 17.1%
Estimated detectable doublet fraction = 79.9%
Overall doublet rate:
Expected = 10.0%
Estimated = 21.5%
Elapsed time: 5.5 seconds
/tmp/ipykernel_2699886/3464195482.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
adata.obs['predicted_doublets'][adatai.obs_names]=predicted_doublets
[43]:
piaso.pl.plot_features_violin(adata,
['scrublet_score'],
groupby='Sample',
width_single=2)
log1p Normalization#
[44]:
%%time
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.layers['log1p']=adata.X.copy()
CPU times: user 198 ms, sys: 26.5 ms, total: 225 ms
Wall time: 124 ms
INFOG Normalization#
We use PIASO’s INFOG to normalize the data and identify a highly variable set of genes.
[45]:
%%time
piaso.tl.infog(adata,
copy=False,
inplace=False,
n_top_genes=3000,
key_added='infog',
key_added_highly_variable_gene='highly_variable',
layer='raw')
The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 590 ms, sys: 199 ms, total: 789 ms
Wall time: 792 ms
Dimensionality reduction#
[46]:
%%time
piaso.tl.runSVD(adata,
use_highly_variable=True,
n_components=30,
random_state=10,
key_added='X_svd',
layer='infog')
CPU times: user 4.07 s, sys: 6.2 ms, total: 4.07 s
Wall time: 493 ms
[47]:
%%time
sc.pp.neighbors(adata,
use_rep='X_svd',
n_neighbors=15,
random_state=10,
knn=True,
method="umap")
sc.tl.umap(adata)
CPU times: user 27.5 s, sys: 92.7 ms, total: 27.6 s
Wall time: 24.2 s
[48]:
sc.pl.umap(adata,
color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
cmap='Spectral_r',
palette=piaso.pl.color.d_color1,
ncols=2,
size=10,
frameon=False)
Save the data#
[13]:
save_dir+'/'+prefix+'_raw_QC.h5ad'
[13]:
'/n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_raw_QC.h5ad'
[50]:
adata.write(save_dir+'/'+prefix+'_raw_QC.h5ad')
[14]:
# adata = sc.read(save_dir+'/'+prefix+'_raw_QC.h5ad')
Filtering cells#
Outlier cells of these thresholds are likely low quality cells, so we filter them out before further downstream analysis.
[15]:
adata=adata[adata.obs['n_genes_by_counts']>500].copy()
[16]:
adata=adata[adata.obs['n_genes_by_counts']<10000].copy()
Remove doublets based on a scrublet score threshold.
[17]:
adata = adata[adata.obs['scrublet_score']<0.3].copy()
adata
[17]:
AnnData object with n_obs × n_vars = 7218 × 36601
obs: 'Sample', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'scrublet_score', 'predicted_doublets'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'ribo', 'infog_var', 'highly_variable'
uns: 'Sample_colors', 'log1p', 'neighbors', 'umap'
obsm: 'X_svd', 'X_umap'
layers: 'infog', 'log1p', 'raw'
obsp: 'connectivities', 'distances'
[18]:
sc.pl.umap(adata,
color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
cmap='Spectral_r',
palette=piaso.pl.color.d_color1,
ncols=2,
size=10,
frameon=False)
[19]:
piaso.pl.plot_embeddings_split(adata,
color='n_genes_by_counts',
cmap=piaso.pl.color.c_color1,
palette=piaso.pl.color.d_color1,
layer=None,
splitby='Sample',
size=25,
frameon=False,)
Annotation#
We used the precomputed Human PBMC marker gene sets from the AllenHumanImmuneHealthAtlas_L2 study in PIASOmarkerDB. The original dataset comes from https://www.nature.com/articles/s41586-025-09686-5.
[20]:
marker_gene_df = pd.read_csv("/n/scratch/users/v/vas744/Data/Public/10x_Human_PBMC/PIASOmarkerDB_AllenHumanImmuneHealthAtlas_L2_251219.csv")
marker_gene_df
[20]:
| Cell_Type | Gene | Specificity_Score | Study_Publication | Species | Tissue | Condition | |
|---|---|---|---|---|---|---|---|
| 0 | Platelet | GP9 | 16.455748 | AllenHumanImmuneHealthAtlas_L2 | Human | blood | normal |
| 1 | Erythrocyte | ALAS2 | 16.422475 | AllenHumanImmuneHealthAtlas_L2 | Human | blood | normal |
| 2 | Platelet | CMTM5 | 16.400534 | AllenHumanImmuneHealthAtlas_L2 | Human | blood | normal |
| 3 | Platelet | TMEM40 | 16.373334 | AllenHumanImmuneHealthAtlas_L2 | Human | blood | normal |
| 4 | Erythrocyte | AHSP | 16.338337 | AllenHumanImmuneHealthAtlas_L2 | Human | blood | normal |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1445 | Memory CD4 T cell | FXYD7 | 1.783520 | AllenHumanImmuneHealthAtlas_L2 | Human | blood | normal |
| 1446 | Memory CD4 T cell | CRIP1 | 1.779265 | AllenHumanImmuneHealthAtlas_L2 | Human | blood | normal |
| 1447 | Memory CD4 T cell | CCDC167 | 1.775660 | AllenHumanImmuneHealthAtlas_L2 | Human | blood | normal |
| 1448 | Memory CD4 T cell | CD6 | 1.775110 | AllenHumanImmuneHealthAtlas_L2 | Human | blood | normal |
| 1449 | Memory CD4 T cell | TMEM238 | 1.769377 | AllenHumanImmuneHealthAtlas_L2 | Human | blood | normal |
1450 rows × 7 columns
Preprocessing cell type-specific marker gene sets#
PIASO’s predictCellTypeByMarker expects the marker_gene_list in DataFrame or dict format. So here, we will create a dictionary with each cell type and their respective marker gene lists.
[21]:
marker_gene_dict = {}
cell_types = np.unique(marker_gene_df['Cell_Type'])
for cell_type in cell_types:
marker_gene_dict[cell_type] = list(marker_gene_df[marker_gene_df['Cell_Type']==cell_type]['Gene'])
Reordering cell type categories
1. INFOG Normalization#
[22]:
%%time
piaso.tl.infog(adata,
copy=False,
inplace=False,
n_top_genes=3000,
key_added='infog',
key_added_highly_variable_gene='highly_variable',
layer='raw')
The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 747 ms, sys: 28.8 s, total: 29.5 s
Wall time: 29.6 s
Dimensionality reduction#
[23]:
%%time
piaso.tl.runSVD(adata,
use_highly_variable=True,
n_components=30,
random_state=10,
key_added='X_svd',
layer='infog')
CPU times: user 5.36 s, sys: 134 ms, total: 5.5 s
Wall time: 894 ms
[24]:
%%time
sc.pp.neighbors(adata,
use_rep='X_svd',
n_neighbors=15,
random_state=10,
knn=True,
method="umap")
sc.tl.umap(adata)
CPU times: user 39.3 s, sys: 308 ms, total: 39.6 s
Wall time: 36.3 s
Predict cell types based on marker gene sets#
Here, we use the use_score = False to make the predictions based on p-values, but you can also use the gene set scores with use_score = True.
[25]:
%%time
piaso.tl.predictCellTypeByMarker(adata,
marker_gene_set=marker_gene_dict,
score_method='piaso',
use_rep='X_svd',
use_score=False,
smooth_prediction=True,
inplace=True)
Calculating gene set scores using piaso method...
Scoring gene sets: 100%|██████████| 29/29 [00:16<00:00, 1.75set/s]
Predicting cell types based on marker gene p-values...
Smoothing cell type predictions...
Smoothing cell type predictions from 'CellTypes_predicted_raw' using 7-nearest neighbors
Smoothed predictions stored in adata.obs['CellTypes_predicted_smoothed']
Confidence scores stored in adata.obs['CellTypes_predicted_smoothed_confidence']
Modified 1632 cell labels (22.61% of total)
Cell type prediction completed. Results saved to:
- adata.obs['CellTypes_predicted']: predicted cell types
- adata.obsm['CellTypes_predicted_score']: full score matrix
- adata.obsm['CellTypes_predicted_nlog10pvals']: full -log10(p-value) matrix
- adata.obs['CellTypes_predicted_nlog10pvals']: maximum -log10(p-values)
- adata.obs['CellTypes_predicted_raw']: original unsmoothed predictions
- adata.obs['CellTypes_predicted_confidence_smoothed']: smoothing confidence scores
CPU times: user 6.85 s, sys: 146 ms, total: 6.99 s
Wall time: 24.6 s
[26]:
category_dict = {"Human_PBMC": ['ASDC', 'CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell',
'Erythrocyte', 'ILC', 'Intermediate monocyte', 'MAIT', 'Memory B cell',
'Memory CD4 T cell', 'Memory CD8 T cell', 'Naive B cell',
'Naive CD4 T cell', 'Naive CD8 T cell', 'Plasma cell', 'Platelet',
'Progenitor cell','Proliferating NK cell', 'Proliferating T cell',
'Transitional B cell', 'Treg', 'cDC1','cDC2', 'gdT','pDC']}
[27]:
cell_types
[27]:
array(['ASDC', 'CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell',
'Erythrocyte', 'ILC', 'Intermediate monocyte', 'MAIT',
'Memory B cell', 'Memory CD4 T cell', 'Memory CD8 T cell',
'Naive B cell', 'Naive CD4 T cell', 'Naive CD8 T cell',
'Plasma cell', 'Platelet', 'Progenitor cell',
'Proliferating NK cell', 'Proliferating T cell',
'Transitional B cell', 'Treg', 'cDC1', 'cDC2', 'gdT', 'pDC'],
dtype=object)
[28]:
np.unique(adata.obs['CellTypes_predicted'])
[28]:
array(['CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell', 'ILC',
'Intermediate monocyte', 'MAIT', 'Memory B cell',
'Memory CD4 T cell', 'Memory CD8 T cell', 'Naive B cell',
'Naive CD4 T cell', 'Naive CD8 T cell', 'Plasma cell', 'Platelet',
'Progenitor cell', 'Proliferating NK cell', 'Transitional B cell',
'Treg', 'cDC1', 'cDC2', 'gdT', 'pDC'], dtype=object)
[31]:
adata.obs['CellTypes_predicted'] = adata.obs['CellTypes_predicted'].astype('category')
[32]:
adata.obs['CellTypes_predicted'] = adata.obs['CellTypes_predicted'].cat.add_categories(
[cat for cat in category_dict["Human_PBMC"]
if cat not in adata.obs['CellTypes_predicted'].cat.categories]
)
adata.obs['CellTypes_predicted'] = adata.obs['CellTypes_predicted'].cat.reorder_categories(
category_dict["Human_PBMC"]
)
adata
[32]:
AnnData object with n_obs × n_vars = 7218 × 36601
obs: 'Sample', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'scrublet_score', 'predicted_doublets', 'CellTypes_predicted', 'CellTypes_predicted_nlog10pvals', 'CellTypes_predicted_raw', 'CellTypes_predicted_smoothed', 'CellTypes_predicted_smoothed_confidence', 'CellTypes_predicted_confidence_smoothed'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'ribo', 'infog_var', 'highly_variable'
uns: 'Sample_colors', 'log1p', 'neighbors', 'umap'
obsm: 'X_svd', 'X_umap', 'CellTypes_predicted_score', 'CellTypes_predicted_nlog10pvals'
layers: 'infog', 'log1p', 'raw'
obsp: 'connectivities', 'distances'
[33]:
sc.pl.umap(adata,
color=['CellTypes_predicted'],
palette=piaso.pl.color.d_color20,
legend_fontsize=8,
legend_fontoutline=1,
# legend_loc='on data',
ncols=1,
size=10,
frameon=False)
Clustering#
[34]:
%%time
sc.tl.leiden(adata,resolution=2.5,key_added='Leiden')
<timed eval>:1: FutureWarning: In the future, the default backend for leiden will be igraph instead of leidenalg.
To achieve the future defaults please pass: flavor="igraph" and n_iterations=2. directed must also be False to work with igraph's implementation.
CPU times: user 4.01 s, sys: 16.3 ms, total: 4.02 s
Wall time: 4.05 s
[35]:
sc.pl.umap(adata,
color=['CellTypes_predicted','Leiden'],
palette=piaso.pl.color.d_color20,
legend_fontsize=12,
legend_fontoutline=2,
# legend_loc='on data',
ncols=1,
size=10,
frameon=False)
Identify marker genes with COSG#
[36]:
%%time
n_gene=30
cosg.cosg(adata,
key_added='cosg',
use_raw=False,
layer='infog',
mu=100,
expressed_pct=0.1,
remove_lowly_expressed=True,
n_genes_user=n_gene,
groupby='Leiden')
CPU times: user 915 ms, sys: 540 ms, total: 1.46 s
Wall time: 1.47 s
[37]:
sc.tl.dendrogram(adata,groupby='Leiden',use_rep='X_svd')
df_tmp=pd.DataFrame(adata.uns['cosg']['names'][:3,]).T
df_tmp=df_tmp.reindex(adata.uns['dendrogram_'+'Leiden']['categories_ordered'])
marker_genes_list={idx: list(row.values) for idx, row in df_tmp.iterrows()}
marker_genes_list = {k: v for k, v in marker_genes_list.items() if not any(isinstance(x, float) for x in v)}
sc.pl.dotplot(adata,
marker_genes_list,
groupby='Leiden',
layer='infog',
dendrogram=True,
swap_axes=False,
standard_scale='var',
cmap='Spectral_r')
[38]:
marker_gene=pd.DataFrame(adata.uns['cosg']['names'])
Removal of low-quality cells#
Here, we identify and remove some low quality clusters based on cell quality metrics and marker gene sets.
[39]:
piaso.pl.plot_features_violin(adata,
['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
groupby='Leiden')
Some of the cells seems to have higher percentage of mitochondrial and ribosomal gene expression, indicating that they are low quality.
[40]:
adata = adata[adata.obs['pct_counts_mt']<20].copy()
[41]:
adata = adata[adata.obs['pct_counts_ribo']<5].copy()
[42]:
sc.pl.umap(adata,
color=['n_genes_by_counts', 'total_counts','pct_counts_mt','pct_counts_ribo', 'scrublet_score'],
cmap='Spectral_r',
palette=piaso.pl.color.d_color1,
ncols=2,
size=10,
frameon=False)
[61]:
sc.pl.umap(adata,
color=['Leiden'],
groups=['29'],
palette=piaso.pl.color.d_color20,
legend_fontsize=12,
legend_fontoutline=2,
# legend_loc='on data',
ncols=1,
size=10,
frameon=False)
Checking potential low quality clusters#
[65]:
cluster_check = '20'
marker_gene[cluster_check].values
[65]:
array(['RPL19', 'RPL32', 'RPL13A', 'RPL28', 'RPS25', 'RPS8', 'RPLP2',
'RPL14', 'RPS12', 'RPS27', 'MKI67', 'RPL22', 'RPL23A', 'RPS15A',
'RPL35', 'RPL13', 'RPS18', 'RPL3', 'RPL17', 'RPS3', 'EEF1A1',
'RPS20', 'RPL12', 'RPS3A', 'RPL18A', 'RPLP0', 'RPS23', 'RPL39',
'RPS6', 'RPL31'], dtype=object)
[63]:
sc.pl.umap(adata,
color=['IGHA1', 'SDC1', 'IGLV2-11', 'JCHAIN', 'IGHG4', 'TNFRSF17',
'IGLV2-14', 'GLDC', 'IGHV3-15', 'TXNDC5', 'IGHG2', 'MIXL1', 'MZB1'],
cmap=piaso.pl.color.c_color1,
palette=piaso.pl.color.d_color1,
ncols=3,
size=10,
frameon=False)
[98]:
sc.pl.dotplot(adata,
marker_gene[cluster_check].values[:30],
groupby='Leiden',
dendrogram=False,
swap_axes=True,
standard_scale='var',
cmap='Spectral_r')
[64]:
sc.pl.umap(adata,
color=['Leiden'],
groups=['17','20','27','28'],
palette=piaso.pl.color.d_color20,
legend_fontsize=12,
legend_fontoutline=2,
# legend_loc='on data',
ncols=1,
size=10,
frameon=False)
[45]:
piaso.pl.plot_embeddings_split(adata,
color='CellTypes_predicted',
layer=None,
splitby='Sample',
size=25,
frameon=False,)
[66]:
adata=adata[~adata.obs['Leiden'].isin(['17','20','27','28'])].copy()
adata
[66]:
AnnData object with n_obs × n_vars = 6591 × 36601
obs: 'Sample', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'scrublet_score', 'predicted_doublets', 'CellTypes_predicted', 'CellTypes_predicted_nlog10pvals', 'CellTypes_predicted_raw', 'CellTypes_predicted_smoothed', 'CellTypes_predicted_smoothed_confidence', 'CellTypes_predicted_confidence_smoothed', 'Leiden'
var: 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'ribo', 'infog_var', 'highly_variable'
uns: 'Sample_colors', 'log1p', 'neighbors', 'umap', 'CellTypes_predicted_colors', 'Leiden', 'Leiden_colors', 'cosg', 'dendrogram_Leiden'
obsm: 'X_svd', 'X_umap', 'CellTypes_predicted_score', 'CellTypes_predicted_nlog10pvals'
layers: 'infog', 'log1p', 'raw'
obsp: 'connectivities', 'distances'
2. INFOG Normalization#
It is important to re-run INFOG normalization and dimensionality reduction after filtering out cells, to accurately capture the current cell population.
[67]:
%%time
piaso.tl.infog(adata,
copy=False,
inplace=False,
n_top_genes=3000,
key_added='infog',
key_added_highly_variable_gene='highly_variable',
layer='raw')
The normalized data is saved as `infog` in `adata.layers`.
The highly variable genes are saved as `highly_variable` in `adata.var`.
Finished INFOG normalization.
CPU times: user 685 ms, sys: 4.9 s, total: 5.59 s
Wall time: 5.61 s
Dimensionality reduction#
[68]:
%%time
piaso.tl.runSVD(adata,
use_highly_variable=True,
n_components=30,
random_state=10,
key_added='X_svd',
layer='infog')
CPU times: user 4.72 s, sys: 0 ns, total: 4.72 s
Wall time: 656 ms
[69]:
%%time
sc.pp.neighbors(adata,
use_rep='X_svd',
n_neighbors=15,
random_state=10,
knn=True,
method="umap")
sc.tl.umap(adata)
CPU times: user 25.4 s, sys: 11.8 ms, total: 25.4 s
Wall time: 21 s
Predict cell types based on marker gene sets#
[70]:
%%time
piaso.tl.predictCellTypeByMarker(adata,
marker_gene_set=marker_gene_dict,
score_method='piaso',
use_rep='X_svd',
use_score=False,
key_added = 'CellTypes_predicted',
smooth_prediction=True,
inplace=True)
Calculating gene set scores using piaso method...
Scoring gene sets: 100%|██████████| 29/29 [00:17<00:00, 1.69set/s]
Predicting cell types based on marker gene p-values...
Smoothing cell type predictions...
Smoothing cell type predictions from 'CellTypes_predicted_raw' using 7-nearest neighbors
Smoothed predictions stored in adata.obs['CellTypes_predicted_smoothed']
Confidence scores stored in adata.obs['CellTypes_predicted_smoothed_confidence']
Modified 1402 cell labels (21.27% of total)
Cell type prediction completed. Results saved to:
- adata.obs['CellTypes_predicted']: predicted cell types
- adata.obsm['CellTypes_predicted_score']: full score matrix
- adata.obsm['CellTypes_predicted_nlog10pvals']: full -log10(p-value) matrix
- adata.obs['CellTypes_predicted_nlog10pvals']: maximum -log10(p-values)
- adata.obs['CellTypes_predicted_raw']: original unsmoothed predictions
- adata.obs['CellTypes_predicted_confidence_smoothed']: smoothing confidence scores
CPU times: user 10.3 s, sys: 103 ms, total: 10.4 s
Wall time: 24.6 s
[71]:
sc.pl.umap(adata,
color=['CellTypes_predicted'],
groups=['CD8aa'],
palette=piaso.pl.color.d_color20,
legend_fontsize=8,
legend_fontoutline=1,
# legend_loc='on data',
ncols=1,
size=10,
frameon=False)
[72]:
sc.pl.umap(adata,
color=['CellTypes_predicted'],
# groups=['CD8aa'],
palette=piaso.pl.color.d_color20,
legend_fontsize=8,
legend_fontoutline=1,
# legend_loc='on data',
ncols=1,
size=10,
frameon=False)
Identify marker genes with COSG#
[73]:
%%time
n_gene=30
cosg.cosg(adata,
key_added='cosg',
use_raw=False,
layer='infog',
mu=100,
expressed_pct=0.1,
remove_lowly_expressed=True,
n_genes_user=n_gene,
groupby='CellTypes_predicted')
CPU times: user 723 ms, sys: 98.9 ms, total: 822 ms
Wall time: 825 ms
[74]:
marker_gene=pd.DataFrame(adata.uns['cosg']['names'])
marker_gene
[74]:
| ASDC | CD8aa | CD14 monocyte | CD16 monocyte | CD56bright NK cell | CD56dim NK cell | DN T cell | Effector B cell | Erythrocyte | ILC | ... | Naive CD4 T cell | Naive CD8 T cell | Plasma cell | Proliferating NK cell | Transitional B cell | Treg | cDC1 | cDC2 | gdT | pDC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | IGLV5-52 | AC005944.1 | DYSF | AC104809.2 | XCL1 | BNC2 | AC004490.1 | SOX5 | LINC01241 | LHX6 | ... | ANKRD55 | AL590006.1 | IGHA1 | AL096677.2 | RAPGEF5 | FOXP3 | PLS3-AS1 | FCER1A | TRDC | LINC01478 |
| 1 | AC012313.2 | AL031590.1 | CRISPLD2 | CDKN1C | SPTSSB | MYOM2 | AP001360.2 | PTCHD1-AS | AL139023.1 | AL353689.3 | ... | EDAR | NRCAM | IGLV7-43 | IGKV1-9 | TTC28 | LINC02694 | STRA6 | CLEC10A | TRGC1 | AC097375.1 |
| 2 | EPB41L4B | TNIP3 | VCAN | GPR20 | CABLES1 | LINGO2 | AL162427.1 | FCRL5 | AL021937.4 | AC005520.3 | ... | FHIT | NELL2 | JCHAIN | IGHV3-66 | MYBL2 | HACD1 | AC090907.1 | ENHO | TRGV9 | CUX2 |
| 3 | AC145207.1 | ZSCAN26 | PADI4 | LYPD2 | TNFRSF11A | SPON2 | PGAM4 | CLECL1 | AC105213.1 | AL109955.2 | ... | PLCL1 | LINC02446 | TNFRSF17 | AC027279.2 | GPM6A | AC013652.1 | DCLK3 | CD1E | KLRC1 | LRRC26 |
| 4 | SLC24A5 | PDGFB | NRG1 | PPP1R17 | PPP1R9A | MIR181A2HG | ALDH3B2 | AL592429.2 | CCL26 | AL662844.3 | ... | KANK1 | GLIS3 | IGHV3-15 | DUSP13 | AL031595.3 | FANK1 | AC108169.1 | CD1C | CCL5 | AC023590.1 |
| 5 | LINC00664 | AL121839.2 | FCAR | AC020651.2 | ZMAT4 | GRIK4 | AC007029.1 | GPM6A | AC074011.1 | AC007370.1 | ... | TSHZ2 | CA6 | TXNDC5 | TCF23 | HS3ST1 | RTKN2 | ANKRD65 | ST18 | A2M | EPHB1 |
| 6 | MOBP | AMIGO1 | LUCAT1 | CROCC2 | SERPINE1 | SORBS2 | AC010255.1 | TNFRSF13B | AC026765.3 | AC007998.4 | ... | MAL | CD8B | IGKV3-15 | TRHR | P2RX5 | DUSP4 | AL118508.1 | NCKAP5 | PZP | LILRA4 |
| 7 | AL138999.2 | ZSCAN2 | AL163541.1 | VMO1 | KLRC2 | FGFBP2 | LINC01865 | NIPSNAP3B | C12orf56 | LRRC66 | ... | LEF1 | FCGBP | MIXL1 | FAM215A | AC060234.3 | CTLA4 | AC026391.1 | PKIB | GNLY | PTPRS |
| 8 | ZSCAN23 | AL359706.1 | CXCL8 | CKB | AC002429.2 | CD160 | AL121757.2 | MS4A1 | SPATA41 | AC084082.1 | ... | ITGA6 | LRRN3 | IGLC2 | AL133297.1 | JUP | STAM | PRSS48 | MRC1 | IL12RB2 | PTCRA |
| 9 | AL445123.2 | STYK1 | MCEMP1 | LINC02345 | RAMP1 | LINC01505 | TEK | ZBTB20-AS5 | CMTM2 | PCAT6 | ... | MLLT3 | PDE3B | IGHV3-30 | LINC01005 | BCL7A | IL2RA | ALOX15 | LRP1B | S100B | LAMP5 |
| 10 | AC002347.1 | ZNF681 | LIN7A | LINC02085 | IGFBP4 | RHOBTB3 | AC116407.1 | EML6 | AP001005.3 | LINC02815 | ... | FAM153A | THEMIS | MZB1 | AC006960.3 | EFCAB11 | CDHR3 | CYP24A1 | CLIC2 | LINC00612 | KCNA5 |
| 11 | STRA8 | RNFT2 | CSF3R | HES4 | STYK1 | SH2D1B | AC020907.1 | ANKRD33B | AC006116.8 | NOL4L-DT | ... | LINC01891 | APBA2 | SDC1 | AC022517.1 | CORO2B | ZC2HC1A | ENPP1 | IL1R2 | KLRG1 | CLEC4C |
| 12 | AP001094.2 | METTL18 | NLRP12 | FMNL2 | NCAM1 | RNF165 | AC027130.1 | IL7 | FXYD6 | AC116158.2 | ... | PRKCA | SNTG2 | GPRC5D | PKMYT1 | AL365272.1 | AC011997.1 | DNAAF3 | HTR7 | TGFBR3 | AL513493.1 |
| 13 | AC112512.1 | TMEM116 | WLS | CEACAM3 | IL12RB2 | LINC00298 | AC090125.1 | OSBPL10 | TNS1 | TBC1D8-AS1 | ... | AK5 | CACHD1 | GLDC | IGLC1 | PAOX | IKZF2 | CFB | DTNA | DUSP2 | SCT |
| 14 | NPAS1 | BCAT2 | PRICKLE1 | SMIM25 | KLRC3 | PRF1 | AC018437.3 | CD82 | PPFIA3 | OTX2-AS1 | ... | MAN1C1 | LEF1 | IGHV3-33 | TERT | TCL1A | MCF2L2 | ARHGEF26-AS1 | FLT3 | CYB561 | FAM160A1-DT |
| 15 | GATA4 | VIPR1-AS1 | SPOCK1 | NEURL1 | ROR1 | KLRF1 | GFRA3 | AL355076.2 | AC005332.3 | NGB | ... | FAM153B | PSMA1 | PLAAT2 | PAGE2B | AL139020.1 | CCDC141 | AC107918.5 | ADGRG6 | SH3BGRL3 | LINC01724 |
| 16 | C20orf144 | ZNF268 | GLT1D1 | CLEC4F | EFHC2 | IGFBP7 | LINC01339 | ANGPTL1 | AL589935.1 | CCDC40 | ... | SLC22A23 | SH3RF3 | IGHV3-23 | IGLV2-11 | MCF2L | TTN | LINC01845 | SH3BP4 | AC010175.1 | AC018643.1 |
| 17 | EXOC3L2 | LYSMD4 | CYP1B1 | PAPSS2 | XCL2 | NMUR1 | AC009435.1 | FCRL2 | AC008649.2 | AL161757.4 | ... | TRABD2A | LEF1-AS1 | IGKC | CDC45 | TCL6 | F5 | AL122008.4 | CST3 | A2M-AS1 | LINC01226 |
| 18 | VWC2 | KLRC4 | ACSL1 | TCF7L2 | CCNJL | SLCO4C1 | AC015631.1 | RALGPS2 | C17orf64 | AF233439.1 | ... | IL6ST | AC002460.2 | DERL3 | IGHV5-51 | IGHD | TOX | ADORA3 | SLAMF8 | KLRD1 | AC011893.1 |
| 19 | CLEC18A | TNFRSF9 | VCAN-AS1 | C1QA | B4GALT6 | LDB2 | CA10 | FCRLA | AC011092.3 | MIA | ... | EPHX2 | EPHA1-AS1 | ITGA8 | AC100827.4 | CR2 | IATPR | CTSV | DANT2 | RAB9A | AR |
| 20 | LINC02033 | STK36 | LINC00937 | KNDC1 | TNFRSF18 | PRSS23 | AL139042.1 | GEN1 | AC069549.1 | STXBP5L | ... | AC139720.1 | OXNAD1 | ZNF215 | IGHGP | PPAN | DCP1B | CERS6-AS1 | CYP2S1 | NKG7 | PHEX |
| 21 | AC025263.1 | LINC01118 | PROK2 | ICAM4 | GNLY | AKR1C3 | AC092844.1 | ACP5 | LRRC42 | HCRTR1 | ... | CAMK4 | CCR7 | AC136431.1 | AC012676.3 | CD79B | AC093865.1 | CLEC9A | NEGR1 | C1orf21 | COL26A1 |
| 22 | AC091180.2 | C21orf62-AS1 | TREM1 | SFTPD | KLRC1 | PDGFD | IMPG1 | CD79B | SLC29A1 | AC096570.1 | ... | CCR7 | SCML1 | IGF1 | ANLN | DPF3 | CYTOR | AC108449.2 | HLA-DRB1 | GZMH | LINC00865 |
| 23 | NRXN2 | KLHL25 | EREG | LST1 | ATP8B4 | CHST2 | C18orf63 | CD79A | TPRN | SRR | ... | IGF1R | CD8A | APOBEC3B | AC003043.1 | KMO | DUSP16 | BBOX1 | LHFPL6 | GZMA | NRP1 |
| 24 | SIGLEC15 | SLC35C1 | AC020916.1 | FCGR3A | NCALD | IL2RB | AURKC | EBF1 | ZBED5-AS1 | AC091057.6 | ... | NDFIP1 | PRKAR1B | IGLV2-14 | AL359091.5 | IGHM | PHF19 | AC093158.1 | ITGB5 | GZMM | CYYR1 |
| 25 | AL138694.1 | DHRS13 | PLAUR | PDE8B | HOPX | NCR1 | GPR176 | DENND5B | LINC01010 | AL359504.2 | ... | AL109930.1 | MAML2 | AC009570.2 | MELK | CD22 | NIBAN1 | AC092661.1 | HLA-DPB1 | IL32 | ARPP21 |
| 26 | AC008764.6 | MED12 | CD36 | IFITM3 | CD7 | GFOD1 | AC087564.1 | TLR9 | BRI3BP | HTR2C | ... | KLHL29 | LRRC7 | DCC | GINS2 | PCDH9 | COX10-AS1 | LDLRAD2 | FBLN2 | GZMK | FAM160A1 |
| 27 | AL031709.1 | TRPC1 | TNFAIP6 | FAM20A | IL2RB | CLIC3 | PRSS51 | ZDHHC21 | PDXP | AL035658.1 | ... | LINC01550 | NR3C2 | BMP6 | APOBEC3B-AS1 | CEP89 | LINC00426 | IAPP | HLA-DPA1 | LYAR | AC104819.3 |
| 28 | PCBP2-OT1 | TULP2 | CYP1B1-AS1 | MS4A7 | NCR1 | TTC38 | LINC00970 | TLR10 | ZNF653 | MRVI1 | ... | CHRM3-AS2 | SERINC5 | NT5DC2 | SPC25 | ZNF318 | LDLRAD4 | AC133961.1 | ZNF366 | HCST | SRPX |
| 29 | CPT1C | AC137810.1 | RBM47 | SIGLEC10 | CTSW | ADGRG1 | LINC01090 | AC090541.1 | LRRC37A | AC036176.1 | ... | TCF7 | ABLIM1 | ITM2C | AC015878.1 | FCRLA | CASK | CADM1 | EMP1 | TRGC2 | LINC00996 |
30 rows × 26 columns
We can use a dendrogram dot plot to visualize the expression of the top three marker genes of each predicted cell type.
[75]:
sc.tl.dendrogram(adata,groupby='CellTypes_predicted',use_rep='X_svd')
df_tmp=pd.DataFrame(adata.uns['cosg']['names'][:3,]).T
df_tmp=df_tmp.reindex(adata.uns['dendrogram_'+'CellTypes_predicted']['categories_ordered'])
marker_genes_list={idx: list(row.values) for idx, row in df_tmp.iterrows()}
marker_genes_list = {k: v for k, v in marker_genes_list.items() if not any(isinstance(x, float) for x in v)}
sc.pl.dotplot(adata,
marker_genes_list,
groupby='CellTypes_predicted',
layer='infog',
dendrogram=True,
swap_axes=False,
standard_scale='var',
cmap='Spectral_r')
Marker genes for CD8aa monocyte#
[78]:
cluster_check = 'CD8aa'
marker_gene[cluster_check].values
[78]:
array(['AC005944.1', 'AL031590.1', 'TNIP3', 'ZSCAN26', 'PDGFB',
'AL121839.2', 'AMIGO1', 'ZSCAN2', 'AL359706.1', 'STYK1', 'ZNF681',
'RNFT2', 'METTL18', 'TMEM116', 'BCAT2', 'VIPR1-AS1', 'ZNF268',
'LYSMD4', 'KLRC4', 'TNFRSF9', 'STK36', 'LINC01118', 'C21orf62-AS1',
'KLHL25', 'SLC35C1', 'DHRS13', 'MED12', 'TRPC1', 'TULP2',
'AC137810.1'], dtype=object)
[79]:
sc.pl.umap(adata,
color=['RNF26', 'BCKDK', 'STYK1', 'ZNF268', 'RNFT2', 'BICRA-AS1',
'ZNF510', 'AMIGO1', 'MIER3', 'AC013437.1', 'SLC24A1', 'LAGE3'],
cmap=piaso.pl.color.c_color1,
palette=piaso.pl.color.d_color1,
ncols=3,
size=10,
frameon=False)
[80]:
sc.pl.dotplot(adata,
marker_gene[cluster_check].values[:30],
groupby='CellTypes_predicted',
dendrogram=False,
swap_axes=True,
standard_scale='var',
cmap='Spectral_r')
Save the data#
[138]:
np.unique(adata.obs['CellTypes_predicted'])
[138]:
array(['ASDC', 'CD14 monocyte', 'CD16 monocyte', 'CD56bright NK cell',
'CD56dim NK cell', 'CD8aa', 'DN T cell', 'Effector B cell',
'Erythrocyte', 'ILC', 'Intermediate monocyte', 'MAIT',
'Memory B cell', 'Memory CD4 T cell', 'Memory CD8 T cell',
'Naive B cell', 'Naive CD4 T cell', 'Naive CD8 T cell',
'Plasma cell', 'Proliferating NK cell', 'Transitional B cell',
'Treg', 'cDC1', 'cDC2', 'gdT', 'pDC'], dtype=object)
[76]:
adata.write(save_dir + '/' + prefix + '_QC.h5ad')
[5]:
adata = sc.read(save_dir + '/' + prefix + '_QC.h5ad')
[77]:
marker_gene.to_csv(save_dir + '/' + prefix + '_CellType_markerGenes.csv')
[81]:
!/home/vas744/Software/gdrive files upload --parent 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0 /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_QC.h5ad
Uploading /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_QC.h5ad
File successfully uploaded
Id: 1bpWCqhadqw29Av9RsqAJqWW7kmEpQOHQ
Name: HumanPBMCs_Multiome_RNA_QC.h5ad
Mime: application/octet-stream
Size: 318.5 MB
Created: 2026-01-20 14:02:15
Modified: 2026-01-20 14:02:15
MD5: bdd76ad91aa88845d0791327f8675972
Shared: True
Parents: 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0
ViewUrl: https://drive.google.com/file/d/1bpWCqhadqw29Av9RsqAJqWW7kmEpQOHQ/view?usp=drivesdk
[82]:
!/home/vas744/Software/gdrive files upload --parent 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0 /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_raw_QC.h5ad
Uploading /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_raw_QC.h5ad
File successfully uploaded
Id: 1DaeasGmgNIBuppSkqT_jiRMDCouxEuiN
Name: HumanPBMCs_Multiome_RNA_raw_QC.h5ad
Mime: application/octet-stream
Size: 407.7 MB
Created: 2026-01-20 14:02:26
Modified: 2026-01-20 14:02:26
MD5: 97852a95d1b331836930c376a5aed3e5
Shared: True
Parents: 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0
ViewUrl: https://drive.google.com/file/d/1DaeasGmgNIBuppSkqT_jiRMDCouxEuiN/view?usp=drivesdk
[83]:
!/home/vas744/Software/gdrive files upload --parent 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0 /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_CellType_markerGenes.csv
Uploading /n/scratch/users/v/vas744/Results/single-cell/Methods/DataProcessing/PBMCMultiomeRop2023_SAN1_SAN2/HumanPBMCs_Multiome_RNA_CellType_markerGenes.csv
File successfully uploaded
Id: 1BLabNMEkVlHfbooFe55jD9d1yrcl_66K
Name: HumanPBMCs_Multiome_RNA_CellType_markerGenes.csv
Mime: text/csv
Size: 6.1 KB
Created: 2026-01-20 14:02:28
Modified: 2026-01-20 14:02:28
MD5: 1ac8ebc9c6b336588015cb2be1ed9432
Shared: True
Parents: 1QpSnNAtb516HZ4jEooTVvjIoo_d8mOh0
ViewUrl: https://drive.google.com/file/d/1BLabNMEkVlHfbooFe55jD9d1yrcl_66K/view?usp=drivesdk
[ ]: