Single-cell Enrichment Analysis#

Single-cell RNA-seq (scRNA-seq) yields many molecular readouts that are hard to interpret by themselves. One common approach to summarize this data is to compute enrichment scores using predefined gene sets based on prior biological knowledge.

In this notebook, we demonstrate how to use decoupler to infer transcription factor (TF) and pathway enrichment scores from a scRNA-seq human dataset.

The dataset includes 3k peripheral blood mononuclear cells (PBMCs) from a healthy donor. It publicly available at the 10x Genomics webpage.

Note

When your data consist of different conditions with multiple samples (patients, mice, etc.), it is recommended to use pseudobulk profiles instead

Loading Packages#

import scanpy as sc
import decoupler as dc

sc.set_figure_params(figsize=(3, 3), frameon=False)

Loading The Dataset#

adata = dc.ds.pbmc3k()
adata

AnnData object with n_obs × n_vars = 2638 × 13714
    obs: 'n_genes', 'percent_mito', 'n_counts', 'celltype', 'leiden'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    obsp: 'distances', 'connectivities'

The obtained anndata.AnnData consist of log-normalized transcript counts for ~3k cells with measurements for ~15k genes.

The cell metadata stored in anndata.AnnData.obs can be inspected.

adata.obs

	n_genes	percent_mito	n_counts	celltype	leiden
index
AAACATACAACCAC-1	781	0.030178	2419.0	CD4 T cells	Clust:0
AAACATTGAGCTAC-1	1352	0.037936	4903.0	B cells	Clust:2
AAACATTGATCAGC-1	1131	0.008897	3147.0	CD4 T cells	Clust:0
AAACCGTGCTTCCG-1	960	0.017431	2639.0	CD14+ Monocytes	Clust:1
AAACCGTGTATGCG-1	522	0.012245	980.0	NK cells	Clust:4
...	...	...	...	...	...
TTTCGAACTCTCAT-1	1155	0.021104	3459.0	CD14+ Monocytes	Clust:1
TTTCTACTGAGGCA-1	1227	0.009294	3443.0	B cells	Clust:2
TTTCTACTTCCTCG-1	622	0.021971	1684.0	B cells	Clust:2
TTTGCATGAGAGGC-1	454	0.020548	1022.0	B cells	Clust:2
TTTGCATGCCTCAC-1	724	0.008065	1984.0	CD4 T cells	Clust:0

2638 rows × 5 columns

And visualized.

sc.pl.umap(
    adata=adata,
    color="leiden",
)

../../_images/8e2a8cd1d9354d28ee61918e266401d6602ba969076d4901511a40f48362d0c2.png

Enrichment analysis#

Enrichment analysis tests whether a specific set of omics features is “overrepresented” or “coordinated” in the measured data compared to a background distribution. These sets are predefined based on existing biological knowledge and may vary depending on the omics technology used.

Enrichment analysis requires the use of an enrichment method, and several options are available. In the original manuscript of decoupler [BiMVSB+22], we benchmarked multiple methods and found that the univariate linear model (ulm) outperformed the others. Therefore, we will use ulm in this vignette.

The scores from decoupler.mt.ulm() should be interpreted such that larger magnitudes indicate greater significance, while the sign reflects whether the features in the set are overrepresented (positive) or underrepresented (negative) compared to the background.

Cell type scoring based on marker genes#

In single-cell analysis, there is no measured readout that directly informs about the cell type of each individual cell. To assign cell type labels, cells are first projected into a shared embedded space. Through clustering, groups of cells with similar transcriptional profiles are then identified, and the expression of known cell type-specific markers are used to annotate the cells. These markers, identified and annotated in previous studies, are collected and curated in various reference resources. When multiple marker genes are available, enrichment methods can be used to score marker sets within cells, providing guidance for annotating the resulting clusters.

PanglaoDB#

PanglaoDB [FGBjorkegren19] is a database that compiles and annotates cell type markers across multiple tissues. It can be conveniently accessed through the OmniPath module of decoupler [TureiKSR16].

Note

Other cell marker databases are avaliable, such as CellTypist, which can be accessed with the same function decoupler.op.resource

markers = dc.op.resource("PanglaoDB", organism="human")
markers

	genesymbol	canonical_marker	cell_type	germ_layer	human	human_sensitivity	human_specificity	mouse	mouse_sensitivity	mouse_specificity	ncbi_tax_id	organ	ubiquitiousness

0	A0A068BD53	True	B cells	Mesoderm	False	0.000000	0.000000	True	0.391473	0.002740	10090	Immune system	0.007
1	A0A068BGV1	True	Dendritic cells	Mesoderm	False	0.000000	0.000000	True	0.142857	0.000193	10090	Immune system	0.000
2	A0A068BGW6	True	B cells	Mesoderm	False	0.000000	0.000000	True	0.391473	0.002740	10090	Immune system	0.007
3	A0A087WRN7	False	Salivary mucous cells	Ectoderm	False	NaN	NaN	True	NaN	NaN	10090	Oral cavity	0.001
4	A0A0F6QAG2	True	NK cells	Mesoderm	False	0.000000	0.000000	True	0.372340	0.000129	10090	Immune system	0.002
...	...	...	...	...	...	...	...	...	...	...	...	...	...
8456	ZNF521	False	Adipocyte progenitor cells	Mesoderm	True	0.000000	0.010338	False	0.000000	0.000000	9606	Connective tissue	0.002
8457	ZNRF4	False	Germ cells	Mesoderm	True	0.284360	0.000000	True	0.000000	0.000000	9606	Reproductive	0.010
8458	ZPBP2	True	Germ cells	Mesoderm	True	0.436019	0.000000	True	0.000000	0.000000	9606	Reproductive	0.008
8459	ZRSR2P1	False	Neurons	Ectoderm	True	0.000000	0.000000	True	0.000000	0.000000	9606	Brain	0.005
8460	ZSCAN10	False	Embryonic stem cells	Epiblast	True	0.000000	0.000313	True	0.000000	0.000000	9606	Zygote	0.000

8461 rows × 13 columns

Since the dataset consists of human cells and high-quality markers are preferred, the markers can be filtered to retain only those labeled as canonical and specific to human.

We also format the network of gene sets into the decoupler format: source, target

# Filter by canonical_marker and human
markers = markers[
    markers["human"].astype(bool)
    & markers["canonical_marker"].astype(bool)
    & (markers["human_sensitivity"].astype(float) > 0.5)
]

# Remove duplicated entries
markers = markers[~markers.duplicated(["cell_type", "genesymbol"])]

# Format
markers = markers.rename(columns={"cell_type": "source", "genesymbol": "target"})
markers = markers[["source", "target"]]
markers

	source	target

50	Pulmonary alveolar type II cells	ABCA3
122	Enterocytes	ACSL5
133	Smooth muscle cells	ACTA2
144	Myoepithelial cells	ACTG2
145	Smooth muscle cells	ACTG2
...	...	...
8341	Endothelial cells	VWF
8350	Luminal epithelial cells	WFDC2
8351	Ductal cells	WFDC2
8380	Podocytes	WT1
8447	Ependymal cells	ZMYND10

691 rows × 2 columns

In this tutorial, these markers will be used, but any gene collection could be applied, including custom gene sets.

Scoring#

We can easily compute cell type enrichment scores by running the ulm method.

dc.mt.ulm(data=adata, net=markers, tmin=3)

The obtained scores (score_ulm) and adjusted p-values (padj_ulm) were stored in the adata.obsm keys.

adata.obsm["score_ulm"]

	Acinar cells	Adipocytes	Alpha cells	B cells	B cells naive	Dendritic cells	Ductal cells	Endothelial cells	Enterocytes	Ependymal cells	...	Pancreatic stellate cells	Pericytes	Plasma cells	Plasmacytoid dendritic cells	Platelets	Podocytes	Proximal tubule cells	Pulmonary alveolar type II cells	Smooth muscle cells	T cells
AAACATACAACCAC-1	-0.574614	0.880015	1.501143	4.869608	3.822776	7.993952	-0.406252	-0.908959	-0.454221	-0.406252	...	-0.351811	1.501143	3.070459	0.325694	4.707824	-0.574614	-0.574614	-0.497593	-0.454221	13.437334
AAACATTGAGCTAC-1	-0.752534	-0.266765	0.907750	10.639314	12.676054	14.592025	2.193505	0.266856	0.217661	-0.532039	...	-0.460741	2.193505	6.653792	1.056778	3.699963	-0.752534	-0.752534	-0.651663	-0.594861	3.827107
AAACATTGATCAGC-1	-0.717204	0.775482	1.195078	3.379488	6.287648	7.469071	-0.507061	-0.018701	-0.566934	-0.507061	...	-0.439111	2.715337	1.806525	-0.318529	6.199206	-0.717204	0.042273	-0.621069	-0.566934	15.211887
AAACCGTGCTTCCG-1	0.614963	2.143531	0.671311	8.120865	4.465337	26.901402	1.338112	0.568580	-0.523795	0.671311	...	-0.405698	0.671311	4.233656	0.420052	7.887647	0.143440	-0.662630	-0.573810	-0.523795	5.473293
AAACCGTGTATGCG-1	-0.498060	-0.556890	1.495823	2.115894	0.170480	15.537294	-0.352129	0.039016	-0.393707	-0.352129	...	-0.304941	-0.352129	0.703835	3.160296	6.316497	-0.498060	-0.498060	-0.431300	-0.393707	3.873352
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
TTTCGAACTCTCAT-1	1.131347	1.491715	1.096136	7.689370	3.972790	26.218950	2.529991	1.203758	-0.569493	-0.509350	...	-0.441093	2.702240	4.672744	1.096436	5.488147	-0.720441	-0.720441	-0.623872	-0.569493	5.171710
TTTCTACTGAGGCA-1	-0.745205	-0.833229	1.081317	9.558244	5.570785	12.581891	2.096218	-0.270738	-0.589067	-0.526857	...	-0.456254	2.984098	13.259767	2.046399	5.558045	-0.745205	-0.745205	-0.645316	-0.589067	2.361846
TTTCTACTTCCTCG-1	-0.525565	-0.587644	1.020571	15.879073	15.877780	17.799659	-0.371575	-0.831368	-0.415449	-0.371575	...	-0.321781	-0.371575	8.865448	1.326769	4.018225	-0.525565	-0.525565	-0.455119	-0.415449	4.885653
TTTGCATGAGAGGC-1	-0.459498	-0.513773	-0.324865	20.431676	15.533753	19.685476	-0.324865	-0.726856	-0.363224	-0.324865	...	-0.281331	-0.324865	8.988557	1.406361	4.553272	-0.459498	-0.459498	-0.397907	-0.363224	5.776182
TTTGCATGCCTCAC-1	-0.560244	-0.626420	-0.396092	2.206213	4.451617	9.228275	1.670463	0.621853	-0.442862	-0.396092	...	-0.343013	0.907680	-0.594251	0.732566	3.389654	-0.560244	-0.560244	-0.485149	-0.442862	12.191711

2638 rows × 36 columns

To visualize the obtained scores, we can re-use many of scanpy’s plotting functions. First though, we need to extract them from the adata object.

score = dc.pp.get_obsm(adata, key="score_ulm")
score

AnnData object with n_obs × n_vars = 2638 × 36
    obs: 'n_genes', 'percent_mito', 'n_counts', 'celltype', 'leiden'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'leiden_colors'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr', 'score_ulm', 'padj_ulm'

decoupler.pp.get_obsm() returns a new anndata.AnnData object which holds the obtained activities in its anndata.AnnData.X attribute.

sc.pl.umap(score, color=["NK cells", "leiden"], cmap="RdBu_r")
sc.pl.violin(score, keys=["NK cells"], groupby="leiden", rotation=90)

../../_images/17abd1ba39caeecd0e8e0210a9147af498973e923d34f77ce37c2c0685034352.png

../../_images/c41020982d242c1f79bc8900ad82d8baa376de2033b9d67f8754180dd192ead9.png

The highlighted cells appear to be enriched for NK cell marker genes, with most of them belonging to cluster 4 or 3.

With decoupler, the top predicted cell types per cluster can also be identified using the function decoupler.tl.rankby_group(). This function identifies “marker” cell types per cluster using the same statistical tests available in scanpy.tl.rank_genes_groups().

df = dc.tl.rankby_group(adata=score, groupby="leiden", reference="rest", method="t-test_overestim_var")
df = df[df["stat"] > 0]
df

	group	reference	name	stat	meanchange	pval	padj
1	Clust:0	rest	T cells	38.206923	4.718293	5.308180e-247	9.554725e-246
22	Clust:0	rest	Neurons	6.885109	0.195677	7.446037e-12	1.165467e-11
23	Clust:0	rest	Erythroid-like and erythroid precursor cells	6.733126	0.211885	2.096498e-11	3.144747e-11
27	Clust:0	rest	Oligodendrocytes	3.551632	0.077581	3.905947e-04	5.021931e-04
28	Clust:0	rest	Pericytes	3.514299	0.194079	4.497034e-04	5.582525e-04
...	...	...	...	...	...	...	...
282	Clust:7	rest	Gamma delta T cells	0.438223	0.729249	6.656753e-01	7.730423e-01
283	Clust:7	rest	Alpha cells	0.297565	0.133760	7.682714e-01	8.285775e-01
285	Clust:7	rest	Neurons	0.278680	0.067593	7.825454e-01	8.285775e-01
286	Clust:7	rest	Luminal epithelial cells	0.220853	0.123104	8.268214e-01	8.504448e-01
287	Clust:7	rest	Fibroblasts	0.173151	0.101706	8.639469e-01	8.639469e-01

133 rows × 7 columns

The top 3 predicted cell types per cluster can then be extracted.

n_ctypes = 3
ctypes_dict = df.groupby("group").head(n_ctypes).groupby("group")["name"].apply(lambda x: list(x)).to_dict()
ctypes_dict

{'Clust:0': ['T cells',
  'Neurons',
  'Erythroid-like and erythroid precursor cells'],
 'Clust:1': ['Neutrophils', 'Monocytes', 'Dendritic cells'],
 'Clust:2': ['B cells', 'B cells naive', 'Plasma cells'],
 'Clust:3': ['NK cells', 'Gamma delta T cells', 'T cells'],
 'Clust:4': ['Gamma delta T cells',
  'NK cells',
  'Plasmacytoid dendritic cells'],
 'Clust:5': ['Macrophages', 'Dendritic cells', 'Monocytes'],
 'Clust:6': ['Dendritic cells', 'Luminal epithelial cells', 'Ductal cells'],
 'Clust:7': ['Platelets', 'Endothelial cells', 'Hepatic stellate cells']}

The obtained top predicted cell types can be visualized.

sc.pl.matrixplot(
    adata=score,
    var_names=ctypes_dict,
    groupby="leiden",
    dendrogram=True,
    standard_scale="var",
    colorbar_title="Z-scaled scores",
    cmap="RdBu_r",
)

WARNING: dendrogram data not found (using key=dendrogram_leiden). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

../../_images/0b9972d5a6adf6e45da3030f0bc8d146388be80ebf9cc787dd9d3bb4cbb9271c.png

The plot reveals that cluster 7 corresponds to Platelets, cluster 4 appears to be NK cells, clusters 0 and 3 may represent T-cells, cluster 2 is likely a form of B cells, and clusters 6, 5, and 1 belong to the myeloid lineage.

Individual cell types can be examined by plotting their distributions.sc.pl.violin(acts, keys=[‘T cells’, ‘B cells’, ‘Platelets’, ‘Monocytes’, ‘NK cells’], groupby=’leiden’)

sc.pl.violin(
    adata=score,
    keys=["T cells", "B cells", "Platelets", "Monocytes", "NK cells"],
    groupby="leiden",
    rotation=90,
)

../../_images/7e609c5f39b142079f3705177fa9e33c37aa25d6b2aafdcc2b39b719ad542dee.png

Final annotation should be performed manually based on the interpretation of enrichment results. However, an automatic prediction can be generated by assigning the top predicted cell type to each cluster. Although this approach does not require tissue-specific expertise, it may lead to inaccuracies. Nevertheless, it can serve as a useful starting point.

dict_ann = df[df["stat"] > 0].groupby("group").head(1).set_index("group")["name"].to_dict()
dict_ann

{'Clust:0': 'T cells',
 'Clust:2': 'B cells',
 'Clust:1': 'Neutrophils',
 'Clust:4': 'Gamma delta T cells',
 'Clust:3': 'NK cells',
 'Clust:5': 'Macrophages',
 'Clust:6': 'Dendritic cells',
 'Clust:7': 'Platelets'}

Once the top cell type has been selected for each cluster, annotation can be performed.

# Update cats
adata.obs["leiden"] = adata.obs["leiden"].cat.rename_categories(dict_ann)

In this case, the predicted annotations can also be compared to the known reference annotations

sc.pl.umap(
    adata=adata,
    color=["celltype", "leiden"],
    ncols=1,
)

../../_images/d57dd02db560acecb64ae85b631d467c6597b06867266db86536bd4e0f85ca44.png

Compared to the original annotation from the scanpy tutorial [WAT18], the results are largely similar, though some discrepancies are present, highlighting the limitations of automatic annotation.

While automatic methods are effective for identifying general lineages, they may be less reliable when distinguishing specific cell type subsets.

Therefore, cell annotation should always be reviewed by a domain expert familiar with the tissue of interest. This section only illustrates how to generate an initial draft annotation. The remainder of the tutorial will use the reference annotations.

Transcription factor scoring from gene regulatory networks#

Transcription factors (TFs) are genes that, once translated into proteins, bind to DNA and regulate the expression of other genes by either promoting or inhibiting their transcription. Gene Regulatory Networks (GRNs) capture these TF-gene interactions and can be constructed from prior knowledge or inferred from omics data. The fundamental unit of a GRN is a TF and its associated target genes, collectively known as a regulon. Each regulon functions as a gene set in enrichment analysis.

Although TFs are measured in transcriptomic data, their transcript levels often do not reflect their actual activity in a given cell. Instead, scoring TFs through enrichment analysis based on the expression of their target genes provides a more accurate representation of their regulatory activity [BiMWMullerD+23].

CollecTRI network#

CollecTRI is a comprehensive resource containing a curated collection of TFs and their transcriptional targets compiled from 12 different resources [MullerDTV+23]. This collection provides an increased coverage of transcription factors and a superior performance in identifying perturbed TFs compared to other literature based GRNs such as DoRothEA [GAHI+19]. Similar to DoRothEA, interactions are weighted by their mode of regulation (activation or inhibition).

In this tutorial we will use the human version but other organisms are available. We can use decoupler to retrieve it from the OmniPath server [TureiKSR16].

Note

In this tutorial we use the network CollecTRI, but we could use any other GRN coming from an inference method such as CellOracle, pySCENIC or SCENIC+.

collectri = dc.op.collectri(organism="human")
collectri

	source	target	weight	resources	references	sign_decision
0	MYC	TERT	1.0	DoRothEA-A;ExTRI;HTRI;NTNU.Curated;Pavlidis202...	10022128;10491298;10606235;10637317;10723141;1...	PMID
1	SPI1	BGLAP	1.0	ExTRI	10022617	default activation
2	SMAD3	JUN	1.0	ExTRI;NTNU.Curated;TFactS;TRRUST	10022869;12374795	PMID
3	SMAD4	JUN	1.0	ExTRI;NTNU.Curated;TFactS;TRRUST	10022869;12374795	PMID
4	STAT5A	IL2	1.0	ExTRI	10022878;11435608;17182565;17911616;22854263;2...	default activation
...	...	...	...	...	...	...
42985	NFKB	hsa-miR-143-3p	1.0	ExTRI	19472311	default activation
42986	AP1	hsa-miR-206	1.0	ExTRI;GEREDB;NTNU.Curated	19721712	PMID
42987	NFKB	hsa-miR-21	1.0	ExTRI	20813833;22387281	default activation
42988	NFKB	hsa-miR-224-5p	1.0	ExTRI	23474441;23988648	default activation
42989	AP1	hsa-miR-144-3p	1.0	ExTRI	23546882	default activation

42990 rows × 6 columns

Scoring#

TF scores can be easily computed by running the ulm method.

dc.mt.ulm(data=adata, net=collectri)

Scores can be then extracted.

score = dc.pp.get_obsm(adata=adata, key="score_ulm")
score

AnnData object with n_obs × n_vars = 2638 × 606
    obs: 'n_genes', 'percent_mito', 'n_counts', 'celltype', 'leiden'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'leiden_colors', 'dendrogram_leiden', 'celltype_colors'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr', 'score_ulm', 'padj_ulm'

And visualized.

tf = "PAX5"
sc.pl.umap(score, color=[tf, "celltype"], cmap="RdBu_r", vcenter=0, title=[f"{tf} score", "celltype"])
sc.pl.violin(score, keys=[tf], groupby="celltype", rotation=90, ylabel=f"{tf} score")

../../_images/c1a15dc75a203e942aa61da90005796cc536335f08a9a7c1d5f20155dbc53bd8.png

../../_images/53e70cda9e66b41e69e008c4278ce63d63df0e2c33dda3763a451a2f39d6c9e4.png

Here, the inferred activity of PAX5 across cells is shown, with particularly high activity in B cells. Notably, PAX5 is a well-established TF essential for B cell identity and function [CSDB07].

As previously noted, TF expression alone is not a reliable proxy for its activity. For instance, although PAX5 is known to be active in B cells, its expression is detected in only a limited number of cells in this dataset. This highlights the importance of using TF enrichment scores.

sc.pl.umap(adata, color=[tf, "celltype"], title=[f"{tf} expression", "celltype"])
sc.pl.violin(adata, keys=[tf], groupby="celltype", rotation=90, ylabel=f"{tf} expression")

../../_images/7192d50853c46d43d04ebb18910a4299725b9ec0b3dab27c7ec6e4bbc5ab7fd5.png

../../_images/fdeff2b0467ba464b017e248842ffce0221a399f27739cf2023e9a7f02d3d0a3.png

Next, marker TFs for each cell type can be identified.

df = dc.tl.rankby_group(adata=score, groupby="celltype", reference="rest", method="t-test_overestim_var")
df = df[df["stat"] > 0]
df

	group	reference	name	stat	meanchange	pval	padj
9	CD4 T cells	rest	ZBTB4	33.614919	2.688046	1.054929e-195	6.392872e-194
11	CD4 T cells	rest	MYC	31.367690	2.156232	1.369677e-171	6.916871e-170
28	CD4 T cells	rest	ZBED1	24.925535	1.379326	9.114879e-121	1.904695e-119
29	CD4 T cells	rest	KLF3	24.540906	1.130052	3.574529e-118	7.220548e-117
43	CD4 T cells	rest	TRERF1	21.797367	0.628027	1.835785e-95	2.528377e-94
...	...	...	...	...	...	...	...
4840	Megakaryocytes	rest	SIM2	0.018855	0.005540	9.850910e-01	9.966029e-01
4841	Megakaryocytes	rest	TRERF1	0.012646	0.002931	9.900187e-01	9.974627e-01
4843	Megakaryocytes	rest	DLX4	0.008465	0.001910	9.933062e-01	9.974627e-01
4844	Megakaryocytes	rest	ZNF350	0.007056	0.001330	9.944218e-01	9.974627e-01
4846	Megakaryocytes	rest	SOX11	0.004654	0.000548	9.963256e-01	9.974627e-01

2427 rows × 7 columns

The top 3 TF markers per cell type can then be extracted.

n_markers = 3
source_markers = (
    df.groupby("group")
    .head(n_markers)
    .drop_duplicates("name")
    .groupby("group")["name"]
    .apply(lambda x: list(x))
    .to_dict()
)
source_markers

{'B cells': ['RFXANK', 'RFX5', 'RFXAP'],
 'CD14+ Monocytes': ['EHF', 'ONECUT1', 'ELF3'],
 'CD4 T cells': ['ZBTB4', 'MYC', 'ZBED1'],
 'CD8 T cells': ['KLF13', 'RELB', 'NFKB2'],
 'Dendritic cells': [],
 'FCGR3A+ Monocytes': ['PPARD', 'SIN3A', 'SPIC'],
 'Megakaryocytes': ['MEIS1', 'PKNOX1', 'PBX2'],
 'NK cells': ['ZGLP1', 'CEBPZ', 'ZNF395']}

The obtained markers can be plotted

sc.pl.matrixplot(
    adata=score,
    var_names=source_markers,
    groupby="celltype",
    dendrogram=True,
    standard_scale="var",
    colorbar_title="Z-scaled scores",
    cmap="RdBu_r",
)

WARNING: dendrogram data not found (using key=dendrogram_celltype). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

../../_images/c33d1c9b5cdbd8b6db1eec8642fdc46660614908b806529b0965a5c28a7f83b3.png

Individual TFs can be examined by plotting their score distributions.

sc.pl.violin(score, keys=["EBF1"], groupby="celltype", rotation=90)

../../_images/3e9f665106c7cb5af07d01f36313840fa506da3240d37bdf444db62b9ff34a2c.png

Here we can observe the TF activities for EBF1, which is a known marker TF for B cells.

We can also plot collectri to see relevant TFs and target genes for B cells:

gex_bcells = adata[adata.obs["celltype"] == "B cells"].to_df().mean(0).to_frame().T
score_bcells = score[score.obs["celltype"] == "B cells"].to_df().mean(0).to_frame().T

dc.pl.network(
    data=gex_bcells,
    score=score_bcells,
    net=collectri,
    sources=["PAX5", "EBF1", "RFX5"],
    targets=20,
    size_node=10,
    figsize=(5, 5),
    s_cmap="Reds",
)

../../_images/746458b1c97a8f3b34bf11d6fd4496e08ea43607cd23caf1f59ded83543a3619.png

Pathway Scoring#

The same approach used for TF scoring can also be applied to pathways. Numerous databases provide curated pathway gene sets, with one of the most well-known being MSigDB, which includes several collections [LSP+11]. These and many other resources can be accessed using the function decoupler.op.resource(). To view the list of available databases, use decoupler.op.show_resources().

PROGENy Pathway Genes#

PROGENy is a comprehensive resource that provides a curated collection of pathways and their target genes, along with weights for each interaction [SKK+18].

Below is a brief description of each pathway:

Androgen: involved in the growth and development of the male reproductive organs
EGFR: regulates growth, survival, migration, apoptosis, proliferation, and differentiation in mammalian cells
Estrogen: promotes the growth and development of the female reproductive organs
Hypoxia: promotes angiogenesis and metabolic reprogramming when O2 levels are low
JAK-STAT: involved in immunity, cell division, cell death, and tumor formation
MAPK: integrates external signals and promotes cell growth and proliferation
NFkB: regulates immune response, cytokine production and cell survival
p53: regulates cell cycle, apoptosis, DNA repair and tumor suppression
PI3K: promotes growth and proliferation
TGFb: involved in development, homeostasis, and repair of most tissues
TNFa: mediates haematopoiesis, immune surveillance, tumour regression and protection from infection
Trail: induces apoptosis
VEGF: mediates angiogenesis, vascular permeability, and cell migration
WNT: regulates organ morphogenesis during development and tissue repair

This is how to access to it.

progeny = dc.op.progeny(organism="human")
progeny

	source	target	weight	padj
0	Androgen	TMPRSS2	11.490631	2.384806e-47
1	Androgen	NKX3-1	10.622551	2.205102e-44
2	Androgen	MBOAT2	10.472733	4.632376e-44
3	Androgen	KLK2	10.176186	1.944410e-40
4	Androgen	SARG	11.386852	2.790210e-40
...	...	...	...	...
62456	p53	ENPP2	2.771405	4.993215e-02
62457	p53	ARRDC4	3.494328	4.996747e-02
62458	p53	MYO1B	-1.148057	4.997905e-02
62459	p53	CTSC	-1.784693	4.998864e-02
62460	p53	NAA50	-1.435013	4.998884e-02

62461 rows × 4 columns

Scoring#

Pathway scores can be easily computed by running the ulm method.

dc.mt.ulm(data=adata, net=progeny)

Scores can be then extracted.

score = dc.pp.get_obsm(adata=adata, key="score_ulm")
score

AnnData object with n_obs × n_vars = 2638 × 14
    obs: 'n_genes', 'percent_mito', 'n_counts', 'celltype', 'leiden'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'leiden_colors', 'dendrogram_leiden', 'celltype_colors', 'dendrogram_celltype'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr', 'score_ulm', 'padj_ulm'

And visualized.

sc.pl.umap(score, color=["Trail", "celltype"], cmap="RdBu_r", vcenter=0)
sc.pl.violin(score, keys=["Trail"], groupby="celltype", rotation=90)

../../_images/ad2c795645258978a441c55fc09385e221f639c9d41f38452a4b80a77f434ba4.png

../../_images/90ea305f7c39555194c170e99ef0c20bcd6a92f6568105a1a573756b7449a5a9.png

It seem that in B cells, the pathway Trail, associated with apoptosis, is more active.

Given that there are only 14 pathways, they can be directly visualized without the need for marker extraction.

sc.pl.matrixplot(
    adata=score,
    var_names=score.var_names,
    groupby="celltype",
    dendrogram=True,
    standard_scale="var",
    colorbar_title="Z-scaled scores",
    cmap="RdBu_r",
)

../../_images/550a24c60b9af1922d94b1f971008d82d91a66cb9690c84b82acec7b6ef1d695.png

In this specific example, it can be observed that TNF-α is more active in FCGR3A+ Monocytes, while PI3K activity is higher in Megakaryocytes.

sc.pl.violin(score, keys=["TNFa", "PI3K"], groupby="celltype", rotation=90)

../../_images/5f476f2fef55402b9abc45548e3584d2a343774a61fed8883708b795cdb2e6d9.png

Hallmark gene sets#

Hallmark gene sets are curated collections of genes that represent specific, well-defined biological states or processes. They are part of MSigDB and were developed to reduce redundancy and improve interpretability compared to older, more overlapping gene set collections [LSP+11].

A total of 50 gene sets are provided, designed to be non-redundant, concise, and biologically coherent.

This is how to access them.

hallmark = dc.op.hallmark(organism="human")
hallmark

	source	target
0	IL2_STAT5_SIGNALING	MAFF
1	COAGULATION	MAFF
2	HYPOXIA	MAFF
3	TNFA_SIGNALING_VIA_NFKB	MAFF
4	COMPLEMENT	MAFF
...	...	...
7313	PANCREAS_BETA_CELLS	STXBP1
7314	PANCREAS_BETA_CELLS	ELP4
7315	PANCREAS_BETA_CELLS	GCG
7316	PANCREAS_BETA_CELLS	PCSK2
7317	PANCREAS_BETA_CELLS	PAX6

7318 rows × 2 columns

Scoring#

Pathway scores can be easily computed by running the ulm method.

dc.mt.ulm(data=adata, net=hallmark)

Scores can be then extracted.

score = dc.pp.get_obsm(adata=adata, key="score_ulm")
score

AnnData object with n_obs × n_vars = 2638 × 50
    obs: 'n_genes', 'percent_mito', 'n_counts', 'celltype', 'leiden'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups', 'leiden_colors', 'dendrogram_leiden', 'celltype_colors', 'dendrogram_celltype'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr', 'score_ulm', 'padj_ulm'

Next, marker gene sets for each cell type can be identified.

df = dc.tl.rankby_group(adata=score, groupby="celltype", reference="rest", method="t-test_overestim_var")
df = df[df["stat"] > 0]
df

	group	reference	name	stat	meanchange	pval	padj
3	CD4 T cells	rest	MYC_TARGETS_V1	21.490565	1.819335	1.590125e-92	1.987656e-91
8	CD4 T cells	rest	MYC_TARGETS_V2	17.441147	0.553157	4.855032e-64	2.697240e-63
11	CD4 T cells	rest	WNT_BETA_CATENIN_SIGNALING	16.358834	0.366900	6.704730e-57	2.793637e-56
22	CD4 T cells	rest	P53_PATHWAY	11.461240	0.493784	1.378643e-29	2.997051e-29
24	CD4 T cells	rest	E2F_TARGETS	9.503262	0.319374	5.152361e-21	1.030472e-20
...	...	...	...	...	...	...	...
391	Megakaryocytes	rest	IL2_STAT5_SIGNALING	0.436819	0.194693	6.665103e-01	7.934646e-01
392	Megakaryocytes	rest	PI3K_AKT_MTOR_SIGNALING	0.383430	0.163172	7.043131e-01	8.189688e-01
394	Megakaryocytes	rest	INFLAMMATORY_RESPONSE	0.105198	0.044449	9.169803e-01	9.828205e-01
395	Megakaryocytes	rest	WNT_BETA_CATENIN_SIGNALING	0.085780	0.013781	9.324955e-01	9.828205e-01
397	Megakaryocytes	rest	GLYCOLYSIS	0.071561	0.027577	9.435076e-01	9.828205e-01

218 rows × 7 columns

The top 3 set markers per cell type can then be extracted.

n_markers = 3
source_markers = (
    df.groupby("group")
    .head(n_markers)
    .drop_duplicates("name")
    .groupby("group")["name"]
    .apply(lambda x: list(x))
    .to_dict()
)
source_markers

{'B cells': ['KRAS_SIGNALING_DN',
  'ALLOGRAFT_REJECTION',
  'INTERFERON_GAMMA_RESPONSE'],
 'CD14+ Monocytes': ['COMPLEMENT',
  'ANGIOGENESIS',
  'EPITHELIAL_MESENCHYMAL_TRANSITION'],
 'CD4 T cells': ['MYC_TARGETS_V1',
  'MYC_TARGETS_V2',
  'WNT_BETA_CATENIN_SIGNALING'],
 'CD8 T cells': ['ANDROGEN_RESPONSE', 'APICAL_JUNCTION'],
 'Dendritic cells': ['OXIDATIVE_PHOSPHORYLATION',
  'FATTY_ACID_METABOLISM',
  'ADIPOGENESIS'],
 'FCGR3A+ Monocytes': ['COAGULATION'],
 'Megakaryocytes': [],
 'NK cells': ['INTERFERON_ALPHA_RESPONSE']}

And plotted.

sc.pl.matrixplot(
    adata=score,
    var_names=source_markers,
    groupby="celltype",
    dendrogram=True,
    standard_scale="var",
    colorbar_title="Z-scaled scores",
    cmap="RdBu_r",
)

../../_images/ac4809ebcfd60cd662820bd69efd95b14bd39641806bff8e072c3e70463280e9.png

Single-cell Enrichment Analysis

Contents

Single-cell Enrichment Analysis#

Loading Packages#

Loading The Dataset#

Enrichment analysis#

Cell type scoring based on marker genes#

PanglaoDB#

Scoring#

Transcription factor scoring from gene regulatory networks#

CollecTRI network#

Scoring#

Pathway Scoring#

PROGENy Pathway Genes#

Scoring#

Hallmark gene sets#

Scoring#