Benchmarking pipeline#

Enrichment analysis involves combining a statistical enrichment method with a prior knowledge network that defines sets of features. A key question is whether the choice of enrichment method or the prior network influences the performance of the resulting scores. When perturbation experiments are available, enrichment methods and prior networks can be benchmarked by evaluating their ability to recover known perturbed regulators.

This notebook demonstrates the use of decoupler for benchmarking enrichment methods and prior knowledge networks using transcription factor (TF) perturbation experiments.

In the first part of this vignette, a toy dataset is employed to benchmark enrichment methods using a fixed prior knowledge network. In the second part, single-gene perturbation data from the KnockTF2 database, a literature-curated resource of perturbation experiments [FSL+19], is used to evaluate various prior knowledge networks.

Evaluation metrics#

To evaluate the enrichment scores obtained from a given network-method combination, their ability to distinguish perturbed from non-perturbed TFs is assessed based on the given ground truth. For instance, TFs identified in the ground truth as positively perturbed (e.g., overexpressed, phosphorylated) are expected to exhibit positive enrichment scores, whereas negatively perturbed TFs (e.g., knockdown, knockout) should display negative enrichment scores. This classification performance can be quantified using three complementary metrics.

Area Under the Curve#

The Area Under the Curve (auc) metric includes both the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Area Under the Precision-Recall Curve (AUPRC). Each curve is generated by applying varying significance thresholds to the enrichment scores, and the corresponding area under each curve is computed. Since AUPRC values are sensitive to class imbalance, which may vary across networks due to differences in TF coverage, a modified version of AUPRC is used, in which the performance of a random classifier is set to 0.5 [SFHG+20]. Similarly, the expected AUROC for a random classifier is also 0.5.

A summary auc score can be calculated as the harmonic mean of the AUROC and AUPRC values.

Precision and Recall (F-Score)#

This score first classifies TFs into the following categories:

  • True Positive (TP): TF is perturbed, and TF has significant enrichment score in the appropiate direction (+ or -)

  • False Positive (FP): TF is not perturbed, but TF has significant enrichment score in any direction

  • False Negative (FN): TF is perturbed, but TF does not have significant enrichment score in the appropiate direction (+ or -)

Precision and recall are then calculated as follows:

Precision=TP/(TP+FP)Precision = TP / (TP + FP)

Recall=TP/(TP+FN)Recall = TP / (TP + FN)

A summary score is computed as the harmonic mean between the precision and recall values, which is traditionally referred to as the F-score (fscore).

Quantile Normalized Ranks#

This score is computed by first ranking the enrichment scores, applying quantile normalization, and then reversing the values by subtracting them from 1. As a result, the transformed scores range from 0 to 1, with higher-magnitude enrichment scores assigned values closer to 1, and lower-magnitude scores closer to 0.

The transformed scores are then grouped into two categories: TFs that belong to the ground truth (perturbed) and those that do not (non-perturbed). A one-sided p-value is calculated using the Wilcoxon rank-sum test to assess whether perturbed TFs tend to have higher quantile-normalized ranks than non-perturbed TFs. The output includes both the mean reverse-quantile-normalized rank of the perturbed TFs and the p-value from the statistical test.

A summary qrank score is then calculated as the harmonic mean of the mean reverse-quantile-normalized rank and the −log10(p-value).

Loading Packages#

import numpy as np

import decoupler as dc

Evaluation of methods#

Loading the dataset#

adata, net = dc.ds.toy_bench()
adata
AnnData object with n_obs × n_vars = 30 × 20
    obs: 'group', 'sample', 'source', 'class', 'type_p'

The obtained anndata.AnnData consist of log-normalized gene expression values for 20 genes across 30 observations.

We can inspect the cell metadata stored in anndata.AnnData.obs.

adata.obs.head(20)
group sample source class type_p
C01 A S01 [T1, T2] CA 1.0
C02 A S02 [T1, T2] CB 1.0
C03 A S01 [T1, T2] CA 1.0
C04 A S03 [T1, T2] CB 1.0
C05 A S02 [T1, T2] CA 1.0
C06 A S01 [T1, T2] CB 1.0
C07 A S01 [T1, T2] CA 1.0
C08 A S02 [T1, T2] CB 1.0
C09 A S02 [T1, T2] CA 1.0
C10 A S03 [T1, T2] CB 1.0
C11 A S01 [T1, T2] CA 1.0
C12 A S03 [T1, T2] CB 1.0
C13 A S02 [T1, T2] CA 1.0
C14 A S03 [T1, T2] CB 1.0
C15 A S01 [T1, T2] CA 1.0
C16 B S01 [T3, T4] CB 1.0
C17 B S01 [T3, T4] CA 1.0
C18 B S02 [T3, T4] CB 1.0
C19 B S03 [T3, T4] CA 1.0
C20 B S01 [T3, T4] CB 1.0

The source column contains the ground truth for each observation, while the type_p column indicates the type of perturbation.

In this dataset, group A is expected to exhibit high positive enrichment scores for transcription factors T1 and T2, whereas group B is expected to do so for T3 and T4.

Evaluation#

To evaluate method performance, enrichment scores are inferred for each method and compared to the ground truth to compute the three distinct metrics.

In this case, a random baseline method is included to illustrate how a purely random approach compares to the other methods. This is implemented using the waggr method, which supports custom functions for computing enrichment scores. This functionality also facilitates the testing of new enrichment methods without requiring full implementation from scratch.

It is also possible to specify that a consensus score should be computed based on the results across methods.

def random(x, w):
    """Random baseline method"""
    return np.random.rand(1)[0]


df = dc.bm.benchmark(
    adata=adata,
    net=net,
    kws_decouple={"cons": True, "tmin": 3, "args": {"waggr": {"fun": random}}},
)
df
method metric score
0 ora auroc 0.665370
1 ora auprc 0.601464
2 ora precision 0.400000
3 ora recall 1.000000
4 ora 1-qrank 0.711667
... ... ... ...
67 consensus auprc 0.931623
68 consensus precision 0.000000
69 consensus recall 0.000000
70 consensus 1-qrank 0.863333
71 consensus -log10(pval) 19.312727

72 rows × 3 columns

Results across the different metrics can then be visualized. Here is the plot for the auc metric.

dc.bm.pl.auc(df, hue="method")
../../_images/23677d34b8cfad14502557ecb1d3a74614659fad4afa134a41c8a246a07c2a8c.png

The auc metric includes both the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Area Under the Precision-Recall Curve (AUPRC). Points positioned in the top right of the plot correspond to better performance scores, while those near the dashed lines indicate lower performance. The dashed lines represent the expected performance of a random classifier. As expected, the random method waggr performs the worst.

Now, here is the plot for the fscore metric.

dc.bm.pl.fscore(df, hue="method")
../../_images/12e6dbeeaaf8728f0ab49cce5d76ce1a985843174fe2bd14e3c717fdb1d5a5c0.png

The fscore metric includes precision and recall, calculated after filtering for significant enrichment scores. As with the auc metric, points located in the top right of the plot indicate better overall performance. In this case, certain methods, such as aucell, exhibit high precision but low recall, while others show the opposite trend, such as ora. Methods such as ulm and mlm achieve a balanced performance with reasonable values for both metrics. As expected, the waggr method, representing a random baseline, performs worse than all other methods.

Finally, here is the plot for the qrank metric.

dc.bm.pl.qrank(df, hue="method")
../../_images/b587636e6d16f96037e2c019c33c121366e360fc23fc7bf7902202b54d47599e.png

The qrank metric comprises two components, the inverse of the quantile-normalized rank of the ground truth sources (1-qrank) and the log-transformed one-sided p-value obtained from testing for differences between the ranks of sources belonging to the ground truth and those that do not (-log10(pval)). Higher values of 1-qrank and of −log10(pval) indicate better performance. Dashed lines represent the expected values for a random method, 0.5 for 1-qrank and -log10(0.05) for -log10(pval). As in previous cases, the waggr baseline behaves as a random classifier.

These pair of metrics can then be summarized using the harmonic mean between them, generating 3 aggragte metrics.

hdf = dc.bm.metric.hmean(df)
hdf
method auprc auroc H(auroc, auprc) precision recall F-score -log10(pval) 1-qrank H(1-qrank, -log10(pval)) score
0 aucell 0.637162 0.675000 0.644387 0.947368 3.000000e-01 0.661765 3.876148 0.705000 2.040490 0.462240
1 consensus 0.931623 0.939259 0.933140 0.000000 7.733222e-17 0.000000 19.312727 0.863333 3.661881 0.777099
2 gsea 0.713912 0.830741 0.734573 0.644444 4.833333e-01 0.604167 10.908073 0.793333 3.072753 0.730810
3 gsva 0.745393 0.742407 0.744794 0.866667 4.333333e-01 0.722222 8.068603 0.763333 2.768867 0.686519
4 mdt 0.640447 0.682778 0.648488 0.500000 2.500000e-01 0.416667 4.313612 0.713333 2.146691 0.428243
5 mlm 0.935272 0.941667 0.936544 0.928571 6.500000e-01 0.855263 19.775370 0.866667 3.686995 1.000000
6 ora 0.601464 0.665370 0.613244 0.400000 1.000000e+00 0.454545 4.575803 0.711667 2.193641 0.440749
7 udt 0.713647 0.827870 0.733899 0.500000 2.500000e-01 0.416667 9.910984 0.783333 2.975857 0.658912
8 ulm 0.935204 0.939630 0.936085 0.933333 4.666667e-01 0.777778 19.775370 0.866667 3.686995 0.980342
9 viper 0.927177 0.930556 0.927851 0.925000 6.166667e-01 0.840909 18.404501 0.856667 3.611012 0.975023
10 waggr 0.567476 0.559815 0.565927 0.000000 0.000000e+00 0.000000 1.077624 0.640000 0.947981 0.000000
11 zscore 0.856637 0.833333 0.851873 0.933333 4.666667e-01 0.777778 14.204125 0.823333 3.341837 0.872048

And these can also be visualized.

dc.bm.pl.bar(df=hdf, x="H(auroc, auprc)", y="method", hue="method")
dc.bm.pl.bar(df=hdf, x="F-score", y="method", hue="method")
dc.bm.pl.bar(df=hdf, x="H(1-qrank, -log10(pval))", y="method", hue="method")
../../_images/e6accb4fea25aa166e0b5d97a997485c046c05a6ce90f103234d9ac814c344db.png ../../_images/05709eee7e62251f8b34387dde2e7e5457dbb532cda794ae50fd1a8e572d56db.png ../../_images/b97b8868299477abced34f8ebdb6debff0a9da2f036be4c4c348edd9bae9f2e1.png

Finally, the individual summary metrics can be aggregated into a single overall score, calculated as the quantile-normalized rank of their mean.

dc.bm.pl.summary(df=hdf, y="method", figsize=(6, 3))
../../_images/7c0a60e2bdac44c05222fdebd28514faa05473d8a16bb105fa16931570de7844.png

In summary, enrichment methods based on simple linear models, mlm and ulm, outperform other approaches in distinguishing perturbed from unperturbed TFs.

These findings are consistent with those reported in the original publication of decoupler, which used real TF perturbation data during evaluation [BiMVSB+22].

Evaluation of Networks#

Loading the Dataset#

adata = dc.ds.knocktf()
adata
AnnData object with n_obs × n_vars = 388 × 21985
    obs: 'source', 'Species', 'Knock.Method', 'Biosample.Name', 'Profile.ID', 'Platform', 'TF.Class', 'TF.Superclass', 'Tissue.Type', 'Biosample.Type', 'Data.Source', 'Pubmed.ID', 'logFC', 'type_p'

The obtained anndata.AnnData consist of log2 fold changes for ~22k genes across ~1k experiments.

We can inspect the cell metadata stored in anndata.AnnData.obs.

adata.obs
source Species Knock.Method Biosample.Name Profile.ID Platform TF.Class TF.Superclass Tissue.Type Biosample.Type Data.Source Pubmed.ID logFC type_p
DataSet_01_006 PROX1 Homo sapiens siRNA Primary lymphatic endothelial cells GSE12846 GPL570 Homeo domain factors Helix-turn-helix domains Foreskin Primary cell GEO 18815287 -1.139770 -1
DataSet_01_007 NR2F2 Homo sapiens siRNA Primary lymphatic endothelial cells GSE12846 GPL570 Nuclear receptors with C4 zinc fingers Zinc-coordinating DNA-binding domains Foreskin Primary cell GEO 18815287 -1.346960 -1
DataSet_01_008 WT1 Homo sapiens siRNA PLC/PRF/5 GSE12886 GPL570 C2H2 zinc finger factors Zinc-coordinating DNA-binding domains Liver Cell line GEO 19190340 -1.114740 -1
DataSet_01_010 TP53 Homo sapiens shRNA iPS cells GSE13334 GPL4133 p53 domain factors Immunoglobulin fold Dermal_fibroblasts Induced pluripotent stem cell GEO 19668191 -3.137450 -1
DataSet_01_011 TP53 Homo sapiens siRNA U251 GSE13991 GPL1426 p53 domain factors Immunoglobulin fold Central_nervous_system Cell line GEO 19139068 -2.438950 -1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
DataSet_04_032 FUBP1 Homo sapiens shRNA HepG2 ENCSR736TAW - Others ENCODE_TF Liver Cell line ENCODE 22955616 -1.270010 -1
DataSet_04_036 SMAD2 Homo sapiens CRISPR HepG2 ENCSR036ANR - SMAD/NF-1 DNA-binding domain factors beta-Hairpin exposed by an alpha/beta-scaffold Liver Cell line ENCODE 22955616 -1.559244 -1
DataSet_04_037 SMAD4 Homo sapiens CRISPR HepG2 ENCSR700FWD - SMAD/NF-1 DNA-binding domain factors beta-Hairpin exposed by an alpha/beta-scaffold Liver Cell line ENCODE 22955616 -1.404555 -1
DataSet_04_039 SRPK2 Homo sapiens CRISPR HepG2 ENCSR929EIP - Others ENCODE_TF Liver Cell line ENCODE 22955616 -1.392551 -1
DataSet_04_041 YBX1 Homo sapiens CRISPR HepG2 ENCSR548OTL - Cold-shock domain factors beta-Barrel DNA-binding domains Liver Cell line ENCODE 22955616 -2.025170 -1

388 rows × 14 columns

Loading the Network#

For this example, we will load two prior knowledge gene regulatory networks, CollecTRI [MullerDTV+23] and DoRoThEA [GAHI+19] .

ct = dc.op.collectri()
do = dc.op.dorothea()
ct.head()
source target weight resources references sign_decision
0 MYC TERT 1.0 DoRothEA-A;ExTRI;HTRI;NTNU.Curated;Pavlidis202... 10022128;10491298;10606235;10637317;10723141;1... PMID
1 SPI1 BGLAP 1.0 ExTRI 10022617 default activation
2 SMAD3 JUN 1.0 ExTRI;NTNU.Curated;TFactS;TRRUST 10022869;12374795 PMID
3 SMAD4 JUN 1.0 ExTRI;NTNU.Curated;TFactS;TRRUST 10022869;12374795 PMID
4 STAT5A IL2 1.0 ExTRI 10022878;11435608;17182565;17911616;22854263;2... default activation
do.head()
source target weight confidence
0 MYC TERT 1.0 A
1 FOS NTS 1.0 A
2 FOS NTF3 1.0 A
3 FOS NFKB1 -1.0 A
4 FOS NEFL 1.0 A

Evaluation#

In this case, two analogous random networks will be generated to serve as random baselines.

rct = dc.pp.shuffle_net(ct)
rdo = dc.pp.shuffle_net(do)

And also each original network but ignoring their mode of regulation (weight) information.

unw_ct = ct.drop(columns="weight")
unw_do = do.drop(columns="weight")

To supply multiple networks to the benchmarking function, they must first be organized into a dictionary of networks. Additionally, only the ulm method is used in this case to reduce computational time.

df = dc.bm.benchmark(
    adata=adata,
    net={
        "collectri": ct,
        "dorothea": do,
        "unw_collectri": unw_ct,
        "unw_dorothea": unw_do,
        "r_collectri": rct,
        "r_dorothea": rdo,
    },
    kws_decouple={
        "methods": "ulm",
    },
)
df.head(20)
net metric score
0 collectri auroc 0.719133
1 collectri auprc 0.760249
2 collectri precision 0.007339
3 collectri recall 0.315412
4 collectri 1-qrank 0.728525
5 collectri -log10(pval) 39.321597
0 dorothea auroc 0.687706
1 dorothea auprc 0.721169
2 dorothea precision 0.010575
3 dorothea recall 0.359813
4 dorothea 1-qrank 0.692722
5 dorothea -log10(pval) 21.809839
0 unw_collectri auroc 0.658961
1 unw_collectri auprc 0.705306
2 unw_collectri precision 0.005095
3 unw_collectri recall 0.286738
4 unw_collectri 1-qrank 0.674280
5 unw_collectri -log10(pval) 23.350754
0 unw_dorothea auroc 0.626609
1 unw_dorothea auprc 0.678251

Results across the different metrics can then be visualized. Here is the plot for the auc metric.

dc.bm.pl.auc(df, hue="net")
../../_images/f9e038a7c76e1cc2aecd08f48bfc0886e3189e4db96599842e236047e99ab2bd.png

Similarly, here is the plot for the fscore metric.

dc.bm.pl.fscore(df, hue="net")
../../_images/56f155924edcba269d422736f42a2c82e5ef1635a382441de6fc2c8fc85c31c4.png

And for the qrank metric.

dc.bm.pl.qrank(df, hue="net")
../../_images/00e7f7c40d0717933875deb2b70f03b89da30141d366f72f91b2f15c1b40f2b8.png

As expected, the random networks are unable to distinguish between perturbed and unperturbed TFs based on the obtained enrichment scores.

Next, the summary statistics can be generated.

hdf = dc.bm.metric.hmean(df)
hdf
net auprc auroc H(auroc, auprc) precision recall F-score -log10(pval) 1-qrank H(1-qrank, -log10(pval)) score
0 collectri 0.760249 0.719133 0.751654 0.007339 0.315412 0.009120 39.321597 0.728525 3.391298 1.000000
1 dorothea 0.721169 0.687706 0.714218 0.010575 0.359813 0.013123 21.809839 0.692722 3.073169 0.897834
2 r_collectri 0.553499 0.536420 0.549997 0.002139 0.075269 0.002655 2.311545 0.545284 1.402780 0.361641
3 r_dorothea 0.523412 0.495025 0.517477 0.005282 0.103286 0.006519 0.161909 0.491971 0.187001 0.000000
4 unw_collectri 0.705306 0.658961 0.695523 0.005095 0.286738 0.006340 23.350754 0.674280 3.022308 0.875649
5 unw_dorothea 0.678251 0.626609 0.667253 0.008210 0.317757 0.010196 12.184879 0.641430 2.649298 0.760155

And visualized.

dc.bm.pl.bar(df=hdf, x="H(auroc, auprc)", y="net", hue="net")
dc.bm.pl.bar(df=hdf, x="F-score", y="net", hue="net")
dc.bm.pl.bar(df=hdf, x="H(1-qrank, -log10(pval))", y="net", hue="net")
../../_images/4ff9e4b620c770829ab8bc49418b6192ebc15324c2b0d4aa38ef6b5c0be06345.png ../../_images/06b70a6eabedc366f00c5a3b3ad75abf160978667b28096c0500d074c803e185.png ../../_images/32354291efe78a2ed12e2bc313402adec02c999c2b23bcd4dbe470f3c61e25eb.png

Finally, the individual summary metrics can be aggregated into a single overall score, calculated as the quantile-normalized rank of their mean.

dc.bm.pl.summary(df=hdf, y="net", figsize=(7, 3))
../../_images/376ac7e0a111ae6c710932ec09cb3c9c5ea959614c71bb1e94029f836e67fb31.png

As expected, either removing the mode of regulation (weight) or shuffling target genes leads to a decrease in evaluation performance.