Usage#

decoupler provides various statistical methods to infer enrichment scores from omics data using prior knowledge [BiMVSB+22].

It is a package from the scverse ecosystem [VBH+23], designed for easy interoperability with anndata and scanpy [WAT18].

Enrichment analysis tests whether a specific set of omics features is “overrepresented” or “coordinated” in the measured data compared to a background distribution. These sets are predefined based on existing biological knowledge and may vary depending on the omics technology used.

Enrichment analysis can be performed at two distinct resolutions:

Observation level: Feature set enrichment scores are computed individually for each observation or sample within the dataset. This is useful for characterizing continous processes that cannot be easily splitted into groups.
Contrast level: When the dataset includes defined groups, differential testing statistics can be derived between groups and subsequently used to compute a final vector of feature set enrichment scores.

This notebook provides a brief demonstration of inferring enrichment scores at the observation level using a toy dataset. For more detailed tutorials, refer to the other vignettes.

Note

In Jupyter Notebooks and JupyterLab, documentation for a Python function can be accessed by pressing SHIFT + TAB (twice to expand) or by typing ?<function_name>.

Loading Packages#

Note

The first time decoupler is run in an environment, execution may be slow due to Numba compiling its source code. Subsequent runs are faster, as the compiled code is cached.

import scanpy as sc
import decoupler as dc

sc.set_figure_params(figsize=(3.5, 3.5), frameon=False)

Loading The Dataset#

adata, net = dc.ds.toy()

The obtained anndata.AnnData consist of log-normalized integer transcript counts for 30 cells with measurements for 20 genes.

adata

AnnData object with n_obs × n_vars = 30 × 20
    obs: 'group', 'sample'

Cells are belong to two distinct groups, “A” and “B”, which exhibit differences in gene expression.

sc.pl.heatmap(
    adata=adata,
    groupby="group",
    var_names=adata.var_names,
)
sc.pl.violin(adata, groupby="group", keys=["G01", "G05", "G13"], size=5)

../_images/3bce0ab29c361fd238bdc76edbab1dcea5c59f8303fd3d6c4870ed700a2564a3.png

../_images/1ee1aba725d96da9cf7700119fc4b8532c557d4bb90873abc115045c68b58f1a.png

Some genes are more expressed in group A (like G01), others in group B (like G05) and others there is no difference (like G13). Ideally, these differences in gene expression programs should be captured in terms of interpretable biological entities. In this example, this is achieved by summarizing gene expression into transcription factor (TF) enrichment scores.

This toy dataset also includes a simple network comprising five TFs with defined regulatory interactions, either positive or negative, with target genes (Gs).

net

	source	target	weight
0	T1	G01	1.0
1	T1	G02	1.0
2	T1	G03	0.7
3	T2	G04	1.0
4	T2	G06	-0.5
5	T2	G07	-3.0
6	T2	G08	-1.0
7	T3	G06	1.0
8	T3	G07	0.5
9	T3	G08	1.0
10	T4	G05	1.9
11	T4	G10	-1.5
12	T4	G11	-2.0
13	T4	G09	3.1
14	T5	G09	0.7
15	T5	G10	1.1
16	T5	G11	0.1

This network can also be visualized like a graph.

dc.pl.network(net, size_node=15, s_cmap="white", t_cmap="white", c_pos_w="darkgreen", c_neg_w="darkred", figsize=(5, 5))

../_images/c7c80360ede63c0caf98aff4e2eda45e72be97953259ba089a29d41272dadd75.png

Based on this network, and the observed pattern of gene expression, cells in group “A” are expected to exhibit positive activity for T1 and T2, while cells in group “B” should show high activity for T3 and T4. T5 is not expected to display activity in any of the cells.

Enrichment Methods#

decoupler contains multiple enrichment methods.

dc.mt.show()

	name	desc	stype	weight	test	limits	reference
0	aucell	AUCell	categorical	False	False	(0, 1)	https://doi.org/10.1038/nmeth.4463
1	gsea	Gene Set Enrichment Analysis (GSEA)	numerical	False	True	(-inf, inf)	https://doi.org/10.1073/pnas.0506580102
2	gsva	Gene Set Variation Analysis (GSVA)	numerical	False	False	(-1, 1)	https://doi.org/10.1186/1471-2105-14-7
3	mdt	Multivariate Decision Tree (MDT)	numerical	True	False	(0, 1)	https://doi.org/10.1093/bioadv/vbac016
4	mlm	Multivariate Linear Model (MLM)	numerical	True	True	(-inf, inf)	https://doi.org/10.1093/bioadv/vbac016
5	ora	Over Representation Analysis (ORA)	categorical	False	True	(-inf, inf)	https://doi.org/10.2307/2340521
6	udt	Univariate Decision Tree (UDT)	numerical	True	False	(0, 1)	https://doi.org/10.1093/bioadv/vbac016
7	ulm	Univariate Linear Model (ULM)	numerical	True	True	(-inf, inf)	https://doi.org/10.1093/bioadv/vbac016
8	viper	Virtual Inference of Protein-activity by Enric...	numerical	True	True	(-inf, inf)	https://doi.org/10.1038/ng.3593
9	waggr	Weighted Aggregate (WAGGR)	numerical	True	True	(-inf, inf)	https://doi.org/10.1093/bioadv/vbac016
10	zscore	Z-score (ZSCORE)	numerical	True	True	(-inf, inf)	https://doi.org/10.1038/s41467-021-21211-6

Each method infers enrichment scores using different types of statistic (categorical or numerical), some incorporate feature weights, others perform statistical testing and return p-values, and the resulting scores may vary in their numerical range.

Consequently, methods may have specific arguments, which can be viewed by running ?decoupler.mt.<method_name>.

To have a unified framework, methods inside decoupler have these shared arguments:

data: AnnData instance, DataFrame or tuple of [matrix, samples, features]
net: Dataframe in long format. Must include source and target columns, and optionally a weight column:
- source: feature set names, for example: TFs, pathways, cell types, etc.
- target: feature names, for example: genes regulated by a TF, phosphosites modified by a Kinase, etc.
- weight: (optional): strength of the source-target interaction. Will be used by enrichment methods that model feature weights. Can be any value between -∞ and +∞
tmin: Minimum number of targets per source. Sources with fewer targets will be removed
raw: Whether to use the .raw attribute of anndata.AnnData
empty: Whether to remove empty observations (rows) and features (columns)
bsize: For large datasets in sparse format, this parameter controls how many observations are processed at once. Increasing this value speeds up computation but uses more memory
verbose: Whether to display progress messages and additional execution details

Running Methods#

Individual methods#

As an example, the Gene Set Enrichment Analysis (gsea) method [LSP+11], one of the most widely used statistical approaches, will be run first.

dc.mt.gsea(data=adata, net=net, tmin=0)

The resulting enrichment scores can then be visualized.

scores = dc.pp.get_obsm(adata, key="score_gsea")
sc.pl.heatmap(
    adata=scores,
    groupby="group",
    var_names=scores.var_names,
    cmap="RdBu_r",
    vcenter=0,
)

../_images/422bf3c93cdc93721441d082a060d34940567aaffed4b4f02c67c04268dded34.png

For TFs T1 and T3, the inferred activities effectively distinguish between the two groups of cells. In contrast, T2 should exhibit negative enrichment scores in the “B” group due to its role as a repressor.

The misestimation of T2’s activity arises from the limitation of GSEA, which does not account for weights when inferring enrichment scores.

When weights are also available in the prior knowledge network, it is recommended to use methods that incorporate this information to obtain more accurate estimates. One such method is the Univariate Linear Model (ULM).

dc.mt.ulm(data=adata, net=net, tmin=0)

As before, score can be plotted per cell.

scores = dc.pp.get_obsm(adata, key="score_ulm")
sc.pl.heatmap(
    adata=scores,
    groupby="group",
    var_names=scores.var_names,
    cmap="RdBu_r",
    vcenter=0,
)

../_images/c731afff2ed97eefd7e2edb6e125a8acc1c33a570f59b716484e55eeb5fbdc07.png

As ulm incorporates weights when estimating enrichment scores, it accurately identifies the repressor T2 as inactive in the “B” group.

Multiple Methods#

decoupler enables the sequential execution of multiple methods.

For example, this line runs all available methods.

dc.mt.decouple(
    data=adata,
    net=net,
    methods="all",
    tmin=0,
)

It also can compute a consensus score based on the inferred activities across methods.

dc.mt.consensus(result=adata)

The resulting score can be visualized.

scores = dc.pp.get_obsm(adata, key="score_consensus")
sc.pl.heatmap(
    adata=scores,
    groupby="group",
    var_names=scores.var_names,
    cmap="RdBu_r",
    vcenter=0,
)

../_images/573f5db1e4706759683b41719970926f30856c73ac5ab4ecbd82fcc957c84ddc.png

The consensus score correctly shows that T1 and T2 are active in the “A” group, while T3 and T4 are active in “B”. T5 is consistently inactive across all samples, as expected.

Access to prior knowledge#

Enrichment analysis requires omics readouts and feature sets. However, feature set databases are distributed across multiple sources with varying formats and access methods, making their use more complex.

decoupler addresses this by leveraging OmniPath, a metadatabase that integrates and processes data from numerous sources [TureiKSR16]. The decoupler.op module provides a unified interface to query any database available within OmniPath.

Here is the list of available resources.

dc.op.show_resources()

	name	license
0	Adhesome	commercial
1	Almen2009	commercial
2	Baccin2019	academic
3	CORUM_Funcat	academic
4	CORUM_GO	academic
...	...	...
76	iTALK	academic
77	kinase.com	non_profit
78	scConnect	commercial
79	scConnect_complex	commercial
80	talklr	commercial

81 rows × 2 columns

As an example, SIGNOR, a database of manually curated causal relationships between human proteins and biological phenotypes, can be downloaded.

dc.op.resource("SIGNOR")

	genesymbol	pathway

0	ABL1	Cell cycle: G2/M phase transition
1	ACE	Focal segmental glomerulosclerosis
2	ACTB	Axon guidance
3	ACTN1	Axon guidance
4	ACTN1	Glutamatergic synapse
...	...	...
2943	YY1	NOTCH Signaling
2944	ZAP70	T cell activation
2945	ZAP70	P38 Signaling
2946	ZBTB16	Acute Myeloid Leukemia
2947	ZFYVE9	TGF-beta Signaling

2948 rows × 2 columns

Conclusion#

Enrichment anaysis helps summarizing high-dimensional hard-to-interpret omics features into interpretable terms such as TFs or pathways.

In the original manuscript of decoupler [BiMVSB+22], we benchmarked multiple methods and found that the univariate linear model (ulm) outperformed the others. Therefore, we will use across the other vignettes.

Additionaly, decoupler provides streamlined access to a wide range of prior knowledge databases via the OmniPath metadatabase.

Usage

Contents