Pseudotime Enrichment Analysis#
Analyzing omics data is not always as straightforward as comparing observations between two discrete groups, such as treatment against control. In some cases, the biological context exhibits a continuous nature, such as cell differentiation, disease progression, or developmental stages. In these scenarios, it is important to assess how features change along the continuous process and how enrichment scores that summarize these features vary across this axis.
This notebook demonstrates the use of decoupler to infer transcription factor (TF)
and pathway enrichment scores from a mouse scRNA-seq dataset.
The dataset includes 1k donwsampled cells from the erytroid lineage during mouse gastrulation [PSGG+19], a continuous process that can be characterized by the inferrence of pseudotime.
Loading Packages#
import numpy as np
import scanpy as sc
import decoupler as dc
sc.set_figure_params(figsize=(3, 3), frameon=False)
Loading The Dataset#
adata = dc.ds.erygast1k()
adata
AnnData object with n_obs × n_vars = 801 × 8124
obs: 'sample', 'stage', 'sequencing.batch', 'theiler', 'celltype'
uns: 'celltype_colors', 'log1p'
obsm: 'X_pca', 'X_umap'
The obtained anndata.AnnData consist of log-normalized transcript counts for ~1k cells
with measurements for ~8k genes.
The cell metadata stored in anndata.AnnData.obs can be inspected.
adata.obs
| sample | stage | sequencing.batch | theiler | celltype | |
|---|---|---|---|---|---|
| GTGGATTGCGTCTC | 29 | E8.5 | 3 | TS12 | Erythroid1 |
| ACTGCCTGAACGGG | 12 | E7.75 | 2 | TS11 | Blood progenitors 2 |
| TCCACTCTCTTCGC | 36 | E8.5 | 3 | TS12 | Erythroid3 |
| GCAGCCGAGTGTTG | 20 | E7.5 | 2 | TS11 | Blood progenitors 2 |
| GATCCCTGATCGTG | 37 | E8.5 | 3 | TS12 | Erythroid3 |
| ... | ... | ... | ... | ... | ... |
| GACATTCTGGAACG | 13 | E7.75 | 2 | TS11 | Blood progenitors 2 |
| ACAAATTGCGTGTA | 36 | E8.5 | 3 | TS12 | Erythroid3 |
| TGCACGCTAACGTC | 37 | E8.5 | 3 | TS12 | Erythroid3 |
| GCAGTTGAGGTTAC | 16 | E8.0 | 2 | TS12 | Erythroid1 |
| ACGGTCCTTTTCAC | 28 | E8.25 | 2 | TS12 | Erythroid2 |
801 rows × 5 columns
And visualized.
Pseudotime inference#
Pseudotime inference estimates the progression of cells along a continuous trajectory based on their proximity in a low-dimensional space derived from molecular readouts, in this case, gene expression.
It requires a defined starting point, typically based on prior biological knowledge of the tissue under study.
In this case, the trajectory is known to originate from the “Blood Progenitors 1” cell type.
The following demonstrates how to infer pseudotime using paga [WHP+19].
Note
In some cases, pseudotime calculation is not necessary, as information about the continuous process of
interest may already be available in the metadata or inferred through alternative methods,
such as factor scoring in mofacell [RFLD+23].
root_celltype = "Blood progenitors 1"
sc.pp.neighbors(adata, use_rep="X_pca")
sc.tl.diffmap(adata)
sc.pp.neighbors(adata, use_rep="X_diffmap")
sc.tl.paga(adata, groups="celltype")
iroot = np.flatnonzero(adata.obs["celltype"] == root_celltype)[0]
adata.uns["iroot"] = iroot
sc.tl.dpt(adata)
Then, he obtained graph of cell types and the predicted pseudotime can be visualized.
sc.pl.paga(adata, color=["celltype"])
sc.pl.umap(adata, color=["celltype", "dpt_pseudotime"], legend_loc="on data")
As expected, the predicted pseudotime orders cells from “Blood progenitors 1” to “Erythroid3”.
Enrichment analysis#
Enrichment analysis tests whether a specific set of omics features is “overrepresented” or “coordinated” in the measured data compared to a background distribution. These sets are predefined based on existing biological knowledge and may vary depending on the omics technology used.
Enrichment analysis requires the use of an enrichment method, and several options are available.
In the original manuscript of decoupler [BiMVSB+22], we benchmarked multiple methods
and found that the univariate linear model (ulm) outperformed the others. Therefore, we will use
ulm in this vignette.
The scores from decoupler.mt.ulm() should be interpreted such that larger magnitudes indicate
greater significance, while the sign reflects whether the features in the set are overrepresented
(positive) or underrepresented (negative) compared to the background.
Transcription factor scoring from gene regulatory networks#
Transcription factors (TFs) are genes that, once translated into proteins, bind to DNA and regulate the expression of other genes by either promoting or inhibiting their transcription. Gene Regulatory Networks (GRNs) capture these TF-gene interactions and can be constructed from prior knowledge or inferred from omics data. The fundamental unit of a GRN is a TF and its associated target genes, collectively known as a regulon. Each regulon functions as a gene set in enrichment analysis.
Although TFs are measured in transcriptomic data, their transcript levels often do not reflect their actual activity in a given cell. Instead, scoring TFs through enrichment analysis based on the expression of their target genes provides a more accurate representation of their regulatory activity [BiMWMullerD+23].
CollecTRI network#
CollecTRI is a comprehensive resource containing a curated collection of TFs and their transcriptional targets compiled from 12 different resources [MullerDTV+23]. This collection provides an increased coverage of transcription factors and a superior performance in identifying perturbed TFs compared to other literature based GRNs such as DoRothEA [GAHI+19]. Similar to DoRothEA, interactions are weighted by their mode of regulation (activation or inhibition).
In this tutorial we will use the mouse version but other organisms are available.
We can use decoupler to retrieve it from the OmniPath server [TureiKSR16].
Note In this tutorial we use the network CollecTRI, but we could use any other GRN coming from an inference method such as CellOracle, pySCENIC or SCENIC+.
collectri = dc.op.collectri(organism="mouse")
collectri
| source | target | weight | resources | references | sign_decision | |
|---|---|---|---|---|---|---|
| 0 | Myc | Tert | 1.0 | DoRothEA-A;ExTRI;HTRI;NTNU.Curated;Pavlidis202... | 10022128;10491298;10606235;10637317;10723141;1... | PMID |
| 1 | Spi1 | Bglap3 | 1.0 | ExTRI | 10022617 | default activation |
| 2 | Spi1 | Bglap | 1.0 | ExTRI | 10022617 | default activation |
| 3 | Spi1 | Bglap2 | 1.0 | ExTRI | 10022617 | default activation |
| 4 | Smad3 | Jun | 1.0 | ExTRI;NTNU.Curated;TFactS;TRRUST | 10022869;12374795 | PMID |
| ... | ... | ... | ... | ... | ... | ... |
| 43221 | Runx1 | Lcp2 | 1.0 | DoRothEA-A | 20019798 | default activation |
| 43222 | Runx1 | Prr5l | 1.0 | DoRothEA-A | 20019798 | default activation |
| 43223 | Twist1 | Gli1 | 1.0 | DoRothEA-A | 11948912 | default activation |
| 43224 | Usf1 | Nup188 | 1.0 | DoRothEA-A | 22951020 | PMID |
| 43225 | Zfp148 | Rnls | 1.0 | DoRothEA-A | 25295465 | PMID |
43226 rows × 6 columns
Scoring#
TF scores can be easily computed by running the ulm method.
dc.mt.ulm(data=adata, net=collectri)
Scores can then be extracted as a new anndata.AnnData object.
score = dc.pp.get_obsm(adata=adata, key="score_ulm")
score
AnnData object with n_obs × n_vars = 801 × 513
obs: 'sample', 'stage', 'sequencing.batch', 'theiler', 'celltype', 'dpt_pseudotime'
uns: 'celltype_colors', 'log1p', 'stage_colors', 'neighbors', 'diffmap_evals', 'paga', 'celltype_sizes', 'iroot'
obsm: 'X_pca', 'X_umap', 'X_diffmap', 'score_ulm', 'padj_ulm'
And visualized.
tf = "Klf3"
sc.pl.umap(score, color=[tf, "celltype"], cmap="RdBu_r", vcenter=0)
sc.pl.violin(score, keys=[tf], groupby="celltype", rotation=90, ylabel=f"{tf} score")
Here, the inferred activity of Klf3 across cells is shown, with an increasing value along the trajectory. Notably, Klf3 is a well-established TF essential in erythroid lineage specification and maturation.
As previously noted, TF expression alone is not a reliable proxy for its activity. For instance, although Klf3 is known to activate during the comitement towards the erythroid lineage, its expression is detected in only a limited number of cells in this dataset. This highlights the importance of using TF enrichment scores.
sc.pl.umap(adata, color=[tf, "celltype"])
sc.pl.violin(adata, keys=[tf], groupby="celltype", rotation=90, ylabel=f"{tf} expression")
Next, TFs associated with the inferred pseudotime can be identified.
tfs = dc.tl.rankby_order(
adata=score,
order="dpt_pseudotime",
stat="dcor",
)
tfs
| name | stat | pval | padj | |
|---|---|---|---|---|
| 0 | Klf3 | 1924.545622 | 0.000000 | 0.000000 |
| 1 | Klf1 | 1802.000715 | 0.000000 | 0.000000 |
| 2 | Hoxb6 | 1720.184784 | 0.000000 | 0.000000 |
| 3 | Nfe2 | 1715.817219 | 0.000000 | 0.000000 |
| 4 | Atrx | 1562.600819 | 0.000000 | 0.000000 |
| ... | ... | ... | ... | ... |
| 508 | Nr0b1 | -0.382287 | 0.648875 | 0.653975 |
| 509 | Msc | -0.464131 | 0.678723 | 0.682715 |
| 510 | Rb1 | -0.501624 | 0.692034 | 0.694742 |
| 511 | Crx | -0.658420 | 0.744866 | 0.746320 |
| 512 | Ep300 | -0.675837 | 0.750428 | 0.750428 |
513 rows × 4 columns
The top 5 TF markers associated to the pseudotime can then be extracted.
top_tfs = tfs.head(5)["name"].to_list()
top_tfs
['Klf3', 'Klf1', 'Hoxb6', 'Nfe2', 'Atrx']
And visualized.
sc.pl.umap(score, color=top_tfs, cmap="coolwarm", vmin=-5, vmax=5, ncols=1, title=[f"{t} score" for t in top_tfs])
Prior to generating more complex visualizations, cells are grouped into bins based on their ordering to facilitate plotting.
bin_tfs = dc.pp.bin_order(
adata=score,
order="dpt_pseudotime",
names=top_tfs,
label="celltype",
)
bin_tfs
| name | order | value | label | color | |
|---|---|---|---|---|---|
| 0 | Klf3 | 0.005 | -1.236535 | Blood progenitors 1 | #f9decf |
| 1 | Klf3 | 0.005 | -0.073794 | Blood progenitors 1 | #f9decf |
| 2 | Klf3 | 0.005 | -0.685060 | Blood progenitors 1 | #f9decf |
| 3 | Klf3 | 0.005 | -0.678414 | Blood progenitors 1 | #f9decf |
| 4 | Klf3 | 0.015 | -0.701368 | Blood progenitors 1 | #f9decf |
| ... | ... | ... | ... | ... | ... |
| 4000 | Atrx | 0.985 | -6.894305 | Erythroid3 | #EF4E22 |
| 4001 | Atrx | 0.985 | -4.275815 | Erythroid3 | #EF4E22 |
| 4002 | Atrx | 0.995 | -5.932188 | Erythroid3 | #EF4E22 |
| 4003 | Atrx | 0.995 | -7.659520 | Erythroid3 | #EF4E22 |
| 4004 | Atrx | 0.995 | -9.214737 | Erythroid3 | #EF4E22 |
4005 rows × 5 columns
Which can be plotted as lines.
Or as a matrix.
dc.pl.order(df=bin_tfs, mode="mat", kw_order={"vmin": -5, "vmax": +5, "cmap": "RdBu_r"}, figsize=(6, 4))
TFs relevant to progenitor cells, such as Atrx and Hoxb6, are active at the beginning of the trajectory. In contrast, TFs involved in commitment to the erythroid lineage, such as the Klf family, are initially inactive but become activated along the trajectory.
Individual TFs can be examined by plotting their score distributions.
Additionally, to better understand the obtained enrichment scores, the targets of a specified TF can be plotted along the trajectory.
dc.pl.order_targets(
adata=adata,
net=collectri,
label="celltype",
source="Atrx",
order="dpt_pseudotime",
)
The change in the enrichment score results from positive targets of Atrx decreasing in expression, while its repressed targets increase in expression along the trajectory. Consequently, the TF is active at the beginning but becomes inactive by the end.
Pathway Scoring#
The same approach used for TF scoring can also be applied to pathways. Numerous
databases provide curated pathway gene sets, with one of the most well-known being MSigDB, which
includes several collections [LSP+11].
These and many other resources can be accessed using the function decoupler.op.resource().
To view the list of available databases, use decoupler.op.show_resources().
PROGENy Pathway Genes#
PROGENy is a comprehensive resource that provides a curated collection of pathways and their target genes, along with weights for each interaction [SKK+18].
Below is a brief description of each pathway:
Androgen: involved in the growth and development of the male reproductive organs
EGFR: regulates growth, survival, migration, apoptosis, proliferation, and differentiation in mammalian cells
Estrogen: promotes the growth and development of the female reproductive organs
Hypoxia: promotes angiogenesis and metabolic reprogramming when O2 levels are low
JAK-STAT: involved in immunity, cell division, cell death, and tumor formation
MAPK: integrates external signals and promotes cell growth and proliferation
NFkB: regulates immune response, cytokine production and cell survival
p53: regulates cell cycle, apoptosis, DNA repair and tumor suppression
PI3K: promotes growth and proliferation
TGFb: involved in development, homeostasis, and repair of most tissues
TNFa: mediates haematopoiesis, immune surveillance, tumour regression and protection from infection
Trail: induces apoptosis
VEGF: mediates angiogenesis, vascular permeability, and cell migration
WNT: regulates organ morphogenesis during development and tissue repair
This is how to access to it.
progeny = dc.op.progeny(organism="mouse")
progeny
| source | target | weight | padj | |
|---|---|---|---|---|
| 0 | Androgen | Tmprss2 | 11.490631 | 2.384806e-47 |
| 1 | Androgen | Nkx3-1 | 10.622551 | 2.205102e-44 |
| 2 | Androgen | Mboat2 | 10.472733 | 4.632376e-44 |
| 3 | Androgen | Slc38a4 | 7.363805 | 1.253071e-39 |
| 4 | Androgen | Mtmr9 | 6.130646 | 2.534403e-38 |
| ... | ... | ... | ... | ... |
| 60572 | p53 | Enpp2 | 2.771405 | 4.993215e-02 |
| 60573 | p53 | Arrdc4 | 3.494328 | 4.996747e-02 |
| 60574 | p53 | Myo1b | -1.148057 | 4.997905e-02 |
| 60575 | p53 | Ctsc | -1.784693 | 4.998864e-02 |
| 60576 | p53 | Naa50 | -1.435013 | 4.998884e-02 |
60577 rows × 4 columns
Scoring#
Pathway scores can be easily computed by running the ulm method.
dc.mt.ulm(data=adata, net=progeny)
Scores can then be extracted.
score = dc.pp.get_obsm(adata=adata, key="score_ulm")
score
AnnData object with n_obs × n_vars = 801 × 14
obs: 'sample', 'stage', 'sequencing.batch', 'theiler', 'celltype', 'dpt_pseudotime'
uns: 'celltype_colors', 'log1p', 'stage_colors', 'neighbors', 'diffmap_evals', 'paga', 'celltype_sizes', 'iroot'
obsm: 'X_pca', 'X_umap', 'X_diffmap', 'score_ulm', 'padj_ulm'
And visualized.
sc.pl.umap(score, color=["PI3K", "celltype"], cmap="RdBu_r", vcenter=0)
sc.pl.violin(score, keys=["PI3K"], groupby="celltype", rotation=90)
Next, the association between pathways and the trajectory can be tested.
pws = dc.tl.rankby_order(
adata=score,
order="dpt_pseudotime",
)
pws
| name | stat | pval | padj | |
|---|---|---|---|---|
| 0 | PI3K | 332.077288 | 0.000000e+00 | 0.000000e+00 |
| 1 | EGFR | 292.234849 | 0.000000e+00 | 0.000000e+00 |
| 2 | MAPK | 271.986785 | 0.000000e+00 | 0.000000e+00 |
| 3 | VEGF | 118.975377 | 0.000000e+00 | 0.000000e+00 |
| 4 | TGFb | 102.538265 | 0.000000e+00 | 0.000000e+00 |
| 5 | Estrogen | 100.245675 | 0.000000e+00 | 0.000000e+00 |
| 6 | p53 | 84.836576 | 0.000000e+00 | 0.000000e+00 |
| 7 | WNT | 81.348290 | 0.000000e+00 | 0.000000e+00 |
| 8 | NFkB | 74.053878 | 0.000000e+00 | 0.000000e+00 |
| 9 | TNFa | 55.833994 | 0.000000e+00 | 0.000000e+00 |
| 10 | JAK-STAT | 46.853171 | 0.000000e+00 | 0.000000e+00 |
| 11 | Trail | 45.698393 | 0.000000e+00 | 0.000000e+00 |
| 12 | Hypoxia | 7.035028 | 9.980905e-13 | 1.074867e-12 |
| 13 | Androgen | 3.231894 | 6.149262e-04 | 6.149262e-04 |
And plotted.
bin_pws = dc.pp.bin_order(
adata=score,
order="dpt_pseudotime",
names=pws.head(5)["name"].to_list(),
label="celltype",
)
bin_pws
| name | order | value | label | color | |
|---|---|---|---|---|---|
| 0 | PI3K | 0.005 | 9.238632 | Blood progenitors 1 | #f9decf |
| 1 | PI3K | 0.005 | 7.656678 | Blood progenitors 1 | #f9decf |
| 2 | PI3K | 0.005 | 7.569328 | Blood progenitors 1 | #f9decf |
| 3 | PI3K | 0.005 | 6.936269 | Blood progenitors 1 | #f9decf |
| 4 | PI3K | 0.015 | 8.850726 | Blood progenitors 1 | #f9decf |
| ... | ... | ... | ... | ... | ... |
| 4000 | TGFb | 0.985 | -2.011171 | Erythroid3 | #EF4E22 |
| 4001 | TGFb | 0.985 | -3.271655 | Erythroid3 | #EF4E22 |
| 4002 | TGFb | 0.995 | -2.304044 | Erythroid3 | #EF4E22 |
| 4003 | TGFb | 0.995 | -2.470572 | Erythroid3 | #EF4E22 |
| 4004 | TGFb | 0.995 | -2.429676 | Erythroid3 | #EF4E22 |
4005 rows × 5 columns
dc.pl.order(df=bin_pws, mode="mat", kw_order={"vmin": -10, "vmax": +10, "cmap": "RdBu_r"}, figsize=(6, 3))
It seems that along the trajectory there are no big changes, just a mild decrease of PI3K activity.
Hallmark gene sets#
Hallmark gene sets are curated collections of genes that represent specific, well-defined biological states or processes. They are part of MSigDB and were developed to reduce redundancy and improve interpretability compared to older, more overlapping gene set collections [LSP+11].
A total of 50 gene sets are provided, designed to be non-redundant, concise, and biologically coherent.
This is how to access them.
hallmark = dc.op.hallmark(organism="mouse")
hallmark
| source | target | |
|---|---|---|
| 0 | IL2_STAT5_SIGNALING | Maff |
| 1 | COAGULATION | Maff |
| 2 | HYPOXIA | Maff |
| 3 | TNFA_SIGNALING_VIA_NFKB | Maff |
| 4 | COMPLEMENT | Maff |
| ... | ... | ... |
| 7606 | PANCREAS_BETA_CELLS | Stxbp1 |
| 7607 | PANCREAS_BETA_CELLS | Elp4 |
| 7608 | PANCREAS_BETA_CELLS | Gcg |
| 7609 | PANCREAS_BETA_CELLS | Pcsk2 |
| 7610 | PANCREAS_BETA_CELLS | Pax6 |
7611 rows × 2 columns
Scoring#
Pathway scores can be easily computed by running the ulm method.
dc.mt.ulm(data=adata, net=hallmark)
Scores can be then extracted.
score = dc.pp.get_obsm(adata=adata, key="score_ulm")
score
AnnData object with n_obs × n_vars = 801 × 50
obs: 'sample', 'stage', 'sequencing.batch', 'theiler', 'celltype', 'dpt_pseudotime'
uns: 'celltype_colors', 'log1p', 'stage_colors', 'neighbors', 'diffmap_evals', 'paga', 'celltype_sizes', 'iroot'
obsm: 'X_pca', 'X_umap', 'X_diffmap', 'score_ulm', 'padj_ulm'
Tested for association to the trajectory.
hlm = dc.tl.rankby_order(
adata=score,
order="dpt_pseudotime",
)
hlm.head()
| name | stat | pval | padj | |
|---|---|---|---|---|
| 0 | HEME_METABOLISM | 1490.562975 | 0.0 | 0.0 |
| 1 | SPERMATOGENESIS | 380.618295 | 0.0 | 0.0 |
| 2 | EPITHELIAL_MESENCHYMAL_TRANSITION | 307.059413 | 0.0 | 0.0 |
| 3 | APICAL_JUNCTION | 287.575912 | 0.0 | 0.0 |
| 4 | ADIPOGENESIS | 280.456456 | 0.0 | 0.0 |
And plotted.
bin_hlm = dc.pp.bin_order(
adata=score,
order="dpt_pseudotime",
names=hlm.head(5)["name"].to_list(),
label="celltype",
)
bin_hlm
| name | order | value | label | color | |
|---|---|---|---|---|---|
| 0 | HEME_METABOLISM | 0.005 | -1.584656 | Blood progenitors 1 | #f9decf |
| 1 | HEME_METABOLISM | 0.005 | -3.012170 | Blood progenitors 1 | #f9decf |
| 2 | HEME_METABOLISM | 0.005 | -1.934134 | Blood progenitors 1 | #f9decf |
| 3 | HEME_METABOLISM | 0.005 | -1.641200 | Blood progenitors 1 | #f9decf |
| 4 | HEME_METABOLISM | 0.015 | -1.550928 | Blood progenitors 1 | #f9decf |
| ... | ... | ... | ... | ... | ... |
| 4000 | ADIPOGENESIS | 0.985 | 3.631165 | Erythroid3 | #EF4E22 |
| 4001 | ADIPOGENESIS | 0.985 | 4.255289 | Erythroid3 | #EF4E22 |
| 4002 | ADIPOGENESIS | 0.995 | 3.177657 | Erythroid3 | #EF4E22 |
| 4003 | ADIPOGENESIS | 0.995 | 3.437600 | Erythroid3 | #EF4E22 |
| 4004 | ADIPOGENESIS | 0.995 | 4.256072 | Erythroid3 | #EF4E22 |
4005 rows × 5 columns