Usage#
decoupler provides various statistical methods to infer enrichment scores
from omics data using prior knowledge [BiMVSB+22].
It is a package from the scverse ecosystem [VBH+23],
designed for easy interoperability with anndata and scanpy [WAT18].
Enrichment analysis tests whether a specific set of omics features is “overrepresented” or “coordinated” in the measured data compared to a background distribution. These sets are predefined based on existing biological knowledge and may vary depending on the omics technology used.
Enrichment analysis can be performed at two distinct resolutions:
Observation level: Feature set enrichment scores are computed individually for each observation or sample within the dataset. This is useful for characterizing continous processes that cannot be easily splitted into groups.
Contrast level: When the dataset includes defined groups, differential testing statistics can be derived between groups and subsequently used to compute a final vector of feature set enrichment scores.
This notebook provides a brief demonstration of inferring enrichment scores at the observation level using a toy dataset. For more detailed tutorials, refer to the other vignettes.
Note
In Jupyter Notebooks and JupyterLab, documentation for a Python function can
be accessed by pressing SHIFT + TAB (twice to expand) or by typing ?<function_name>.
Loading Packages#
Note
The first time decoupler is run in an environment, execution may be slow due to Numba compiling its source code. Subsequent runs are faster, as the compiled code is cached.
import scanpy as sc
import decoupler as dc
sc.set_figure_params(figsize=(3.5, 3.5), frameon=False)
Loading The Dataset#
adata, net = dc.ds.toy()
The obtained anndata.AnnData consist of log-normalized integer transcript counts for 30 cells
with measurements for 20 genes.
adata
AnnData object with n_obs × n_vars = 30 × 20
obs: 'group', 'sample'
Cells are belong to two distinct groups, “A” and “B”, which exhibit differences in gene expression.
sc.pl.heatmap(
adata=adata,
groupby="group",
var_names=adata.var_names,
)
sc.pl.violin(adata, groupby="group", keys=["G01", "G05", "G13"], size=5)
Some genes are more expressed in group A (like G01), others in group B (like G05) and others there is no difference (like G13). Ideally, these differences in gene expression programs should be captured in terms of interpretable biological entities. In this example, this is achieved by summarizing gene expression into transcription factor (TF) enrichment scores.
This toy dataset also includes a simple network comprising five TFs with defined regulatory interactions, either positive or negative, with target genes (Gs).
net
| source | target | weight | |
|---|---|---|---|
| 0 | T1 | G01 | 1.0 |
| 1 | T1 | G02 | 1.0 |
| 2 | T1 | G03 | 0.7 |
| 3 | T2 | G04 | 1.0 |
| 4 | T2 | G06 | -0.5 |
| 5 | T2 | G07 | -3.0 |
| 6 | T2 | G08 | -1.0 |
| 7 | T3 | G06 | 1.0 |
| 8 | T3 | G07 | 0.5 |
| 9 | T3 | G08 | 1.0 |
| 10 | T4 | G05 | 1.9 |
| 11 | T4 | G10 | -1.5 |
| 12 | T4 | G11 | -2.0 |
| 13 | T4 | G09 | 3.1 |
| 14 | T5 | G09 | 0.7 |
| 15 | T5 | G10 | 1.1 |
| 16 | T5 | G11 | 0.1 |
This network can also be visualized like a graph.
dc.pl.network(net, size_node=15, s_cmap="white", t_cmap="white", c_pos_w="darkgreen", c_neg_w="darkred", figsize=(5, 5))
Based on this network, and the observed pattern of gene expression, cells in group “A” are expected to exhibit positive activity for T1 and T2, while cells in group “B” should show high activity for T3 and T4. T5 is not expected to display activity in any of the cells.
Enrichment Methods#
decoupler contains multiple enrichment methods.
dc.mt.show()
| name | desc | stype | weight | test | limits | reference | |
|---|---|---|---|---|---|---|---|
| 0 | aucell | AUCell | categorical | False | False | (0, 1) | https://doi.org/10.1038/nmeth.4463 |
| 1 | gsea | Gene Set Enrichment Analysis (GSEA) | numerical | False | True | (-inf, inf) | https://doi.org/10.1073/pnas.0506580102 |
| 2 | gsva | Gene Set Variation Analysis (GSVA) | numerical | False | False | (-1, 1) | https://doi.org/10.1186/1471-2105-14-7 |
| 3 | mdt | Multivariate Decision Tree (MDT) | numerical | True | False | (0, 1) | https://doi.org/10.1093/bioadv/vbac016 |
| 4 | mlm | Multivariate Linear Model (MLM) | numerical | True | True | (-inf, inf) | https://doi.org/10.1093/bioadv/vbac016 |
| 5 | ora | Over Representation Analysis (ORA) | categorical | False | True | (-inf, inf) | https://doi.org/10.2307/2340521 |
| 6 | udt | Univariate Decision Tree (UDT) | numerical | True | False | (0, 1) | https://doi.org/10.1093/bioadv/vbac016 |
| 7 | ulm | Univariate Linear Model (ULM) | numerical | True | True | (-inf, inf) | https://doi.org/10.1093/bioadv/vbac016 |
| 8 | viper | Virtual Inference of Protein-activity by Enric... | numerical | True | True | (-inf, inf) | https://doi.org/10.1038/ng.3593 |
| 9 | waggr | Weighted Aggregate (WAGGR) | numerical | True | True | (-inf, inf) | https://doi.org/10.1093/bioadv/vbac016 |
| 10 | zscore | Z-score (ZSCORE) | numerical | True | True | (-inf, inf) | https://doi.org/10.1038/s41467-021-21211-6 |
Each method infers enrichment scores using different types of statistic (categorical or numerical), some incorporate feature weights, others perform statistical testing and return p-values, and the resulting scores may vary in their numerical range.
Consequently, methods may have specific arguments,
which can be viewed by running ?decoupler.mt.<method_name>.
To have a unified framework, methods inside decoupler have these shared arguments:
data: AnnData instance, DataFrame or tuple of [matrix, samples, features]net: Dataframe in long format. Must includesourceandtargetcolumns, and optionally aweightcolumn:source: feature set names, for example: TFs, pathways, cell types, etc.target: feature names, for example: genes regulated by a TF, phosphosites modified by a Kinase, etc.weight: (optional): strength of the source-target interaction. Will be used by enrichment methods that model feature weights. Can be any value between -∞ and +∞
tmin: Minimum number of targets per source. Sources with fewer targets will be removedraw: Whether to use the.rawattribute ofanndata.AnnDataempty: Whether to remove empty observations (rows) and features (columns)bsize: For large datasets in sparse format, this parameter controls how many observations are processed at once. Increasing this value speeds up computation but uses more memoryverbose: Whether to display progress messages and additional execution details
Running Methods#
Individual methods#
As an example, the Gene Set Enrichment Analysis (gsea) method [LSP+11],
one of the most widely used statistical approaches,
will be run first.
dc.mt.gsea(data=adata, net=net, tmin=0)
The resulting enrichment scores can then be visualized.
scores = dc.pp.get_obsm(adata, key="score_gsea")
sc.pl.heatmap(
adata=scores,
groupby="group",
var_names=scores.var_names,
cmap="RdBu_r",
vcenter=0,
)
For TFs T1 and T3, the inferred activities effectively distinguish between the two groups of cells. In contrast, T2 should exhibit negative enrichment scores in the “B” group due to its role as a repressor.
The misestimation of T2’s activity arises from the limitation of GSEA, which does not account for weights when inferring enrichment scores.
When weights are also available in the prior knowledge network, it is recommended to use methods that incorporate this information to obtain more accurate estimates. One such method is the Univariate Linear Model (ULM).
dc.mt.ulm(data=adata, net=net, tmin=0)
As before, score can be plotted per cell.
scores = dc.pp.get_obsm(adata, key="score_ulm")
sc.pl.heatmap(
adata=scores,
groupby="group",
var_names=scores.var_names,
cmap="RdBu_r",
vcenter=0,
)
As ulm incorporates weights when estimating enrichment scores,
it accurately identifies the repressor T2 as inactive in the “B” group.
Multiple Methods#
decoupler enables the sequential execution of multiple methods.
For example, this line runs all available methods.
dc.mt.decouple(
data=adata,
net=net,
methods="all",
tmin=0,
)
It also can compute a consensus score based on the inferred activities across methods.
dc.mt.consensus(result=adata)
The resulting score can be visualized.
scores = dc.pp.get_obsm(adata, key="score_consensus")
sc.pl.heatmap(
adata=scores,
groupby="group",
var_names=scores.var_names,
cmap="RdBu_r",
vcenter=0,
)
The consensus score correctly shows that T1 and T2 are active in the “A” group, while T3 and T4 are active in “B”. T5 is consistently inactive across all samples, as expected.
Access to prior knowledge#
Enrichment analysis requires omics readouts and feature sets. However, feature set databases are distributed across multiple sources with varying formats and access methods, making their use more complex.
decoupler addresses this by leveraging
OmniPath, a metadatabase that
integrates and processes data from numerous sources [TureiKSR16].
The decoupler.op module provides a unified interface
to query any database available within OmniPath.
Here is the list of available resources.
dc.op.show_resources()
| name | license | |
|---|---|---|
| 0 | Adhesome | commercial |
| 1 | Almen2009 | commercial |
| 2 | Baccin2019 | academic |
| 3 | CORUM_Funcat | academic |
| 4 | CORUM_GO | academic |
| ... | ... | ... |
| 76 | iTALK | academic |
| 77 | kinase.com | non_profit |
| 78 | scConnect | commercial |
| 79 | scConnect_complex | commercial |
| 80 | talklr | commercial |
81 rows × 2 columns
As an example, SIGNOR, a database of manually curated causal relationships between human proteins and biological phenotypes, can be downloaded.
dc.op.resource("SIGNOR")
| genesymbol | pathway | |
|---|---|---|
| 0 | ABL1 | Cell cycle: G2/M phase transition |
| 1 | ACE | Focal segmental glomerulosclerosis |
| 2 | ACTB | Axon guidance |
| 3 | ACTN1 | Axon guidance |
| 4 | ACTN1 | Glutamatergic synapse |
| ... | ... | ... |
| 2943 | YY1 | NOTCH Signaling |
| 2944 | ZAP70 | T cell activation |
| 2945 | ZAP70 | P38 Signaling |
| 2946 | ZBTB16 | Acute Myeloid Leukemia |
| 2947 | ZFYVE9 | TGF-beta Signaling |
2948 rows × 2 columns
Conclusion#
Enrichment anaysis helps summarizing high-dimensional hard-to-interpret omics features into interpretable terms such as TFs or pathways.
In the original manuscript of decoupler [BiMVSB+22], we benchmarked multiple methods
and found that the univariate linear model (ulm) outperformed the others. Therefore, we will use
across the other vignettes.
Additionaly, decoupler provides streamlined access to a wide range of prior
knowledge databases via the OmniPath metadatabase.