Benchmarking pipeline#

Enrichment analysis involves combining a statistical enrichment method with a prior knowledge network that defines sets of features. A key question is whether the choice of enrichment method or the prior network influences the performance of the resulting scores. When perturbation experiments are available, enrichment methods and prior networks can be benchmarked by evaluating their ability to recover known perturbed regulators.

This notebook demonstrates the use of decoupler for benchmarking enrichment methods and prior knowledge networks using transcription factor (TF) perturbation experiments.

In the first part of this vignette, a toy dataset is employed to benchmark enrichment methods using a fixed prior knowledge network. In the second part, single-gene perturbation data from the KnockTF2 database, a literature-curated resource of perturbation experiments [FSL+19], is used to evaluate various prior knowledge networks.

Evaluation metrics#

To evaluate the enrichment scores obtained from a given network-method combination, their ability to distinguish perturbed from non-perturbed TFs is assessed based on the given ground truth. For instance, TFs identified in the ground truth as positively perturbed (e.g., overexpressed, phosphorylated) are expected to exhibit positive enrichment scores, whereas negatively perturbed TFs (e.g., knockdown, knockout) should display negative enrichment scores. This classification performance can be quantified using three complementary metrics.

Area Under the Curve#

The Area Under the Curve (auc) metric includes both the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Area Under the Precision-Recall Curve (AUPRC). Each curve is generated by applying varying significance thresholds to the enrichment scores, and the corresponding area under each curve is computed. Since AUPRC values are sensitive to class imbalance, which may vary across networks due to differences in TF coverage, a modified version of AUPRC is used, in which the performance of a random classifier is set to 0.5 [SFHG+20]. Similarly, the expected AUROC for a random classifier is also 0.5.

A summary auc score can be calculated as the harmonic mean of the AUROC and AUPRC values.

Precision and Recall (F-Score)#

This score first classifies TFs into the following categories:

True Positive (TP): TF is perturbed, and TF has significant enrichment score in the appropiate direction (+ or -)
False Positive (FP): TF is not perturbed, but TF has significant enrichment score in any direction
False Negative (FN): TF is perturbed, but TF does not have significant enrichment score in the appropiate direction (+ or -)

Precision and recall are then calculated as follows:

$Precision = TP / (TP + FP)$

$Recall = TP / (TP + FN)$

A summary score is computed as the harmonic mean between the precision and recall values, which is traditionally referred to as the F-score (fscore).

Quantile Normalized Ranks#

This score is computed by first ranking the enrichment scores, applying quantile normalization, and then reversing the values by subtracting them from 1. As a result, the transformed scores range from 0 to 1, with higher-magnitude enrichment scores assigned values closer to 1, and lower-magnitude scores closer to 0.

The transformed scores are then grouped into two categories: TFs that belong to the ground truth (perturbed) and those that do not (non-perturbed). A one-sided p-value is calculated using the Wilcoxon rank-sum test to assess whether perturbed TFs tend to have higher quantile-normalized ranks than non-perturbed TFs. The output includes both the mean reverse-quantile-normalized rank of the perturbed TFs and the p-value from the statistical test.

A summary qrank score is then calculated as the harmonic mean of the mean reverse-quantile-normalized rank and the −log10(p-value).

Loading Packages#

import numpy as np

import decoupler as dc

Evaluation of methods#

Loading the dataset#

adata, net = dc.ds.toy_bench()
adata

AnnData object with n_obs × n_vars = 30 × 20
    obs: 'group', 'sample', 'source', 'class', 'type_p'

The obtained anndata.AnnData consist of log-normalized gene expression values for 20 genes across 30 observations.

We can inspect the cell metadata stored in anndata.AnnData.obs.

adata.obs.head(20)

	group	sample	source	class	type_p
C01	A	S01	[T1, T2]	CA	1.0
C02	A	S02	[T1, T2]	CB	1.0
C03	A	S01	[T1, T2]	CA	1.0
C04	A	S03	[T1, T2]	CB	1.0
C05	A	S02	[T1, T2]	CA	1.0
C06	A	S01	[T1, T2]	CB	1.0
C07	A	S01	[T1, T2]	CA	1.0
C08	A	S02	[T1, T2]	CB	1.0
C09	A	S02	[T1, T2]	CA	1.0
C10	A	S03	[T1, T2]	CB	1.0
C11	A	S01	[T1, T2]	CA	1.0
C12	A	S03	[T1, T2]	CB	1.0
C13	A	S02	[T1, T2]	CA	1.0
C14	A	S03	[T1, T2]	CB	1.0
C15	A	S01	[T1, T2]	CA	1.0
C16	B	S01	[T3, T4]	CB	1.0
C17	B	S01	[T3, T4]	CA	1.0
C18	B	S02	[T3, T4]	CB	1.0
C19	B	S03	[T3, T4]	CA	1.0
C20	B	S01	[T3, T4]	CB	1.0

The source column contains the ground truth for each observation, while the type_p column indicates the type of perturbation.

In this dataset, group A is expected to exhibit high positive enrichment scores for transcription factors T1 and T2, whereas group B is expected to do so for T3 and T4.

Evaluation#

To evaluate method performance, enrichment scores are inferred for each method and compared to the ground truth to compute the three distinct metrics.

In this case, a random baseline method is included to illustrate how a purely random approach compares to the other methods. This is implemented using the waggr method, which supports custom functions for computing enrichment scores. This functionality also facilitates the testing of new enrichment methods without requiring full implementation from scratch.

It is also possible to specify that a consensus score should be computed based on the results across methods.

def random(x, w):
    """Random baseline method"""
    return np.random.rand(1)[0]


df = dc.bm.benchmark(
    adata=adata,
    net=net,
    kws_decouple={"cons": True, "tmin": 3, "args": {"waggr": {"fun": random}}},
)

df

	method	metric	score
0	ora	auroc	0.665370
1	ora	auprc	0.601464
2	ora	precision	0.400000
3	ora	recall	1.000000
4	ora	1-qrank	0.711667
...	...	...	...
67	consensus	auprc	0.931623
68	consensus	precision	0.000000
69	consensus	recall	0.000000
70	consensus	1-qrank	0.863333
71	consensus	-log10(pval)	19.312727

72 rows × 3 columns

Results across the different metrics can then be visualized. Here is the plot for the auc metric.

dc.bm.pl.auc(df, hue="method")

../../_images/23677d34b8cfad14502557ecb1d3a74614659fad4afa134a41c8a246a07c2a8c.png

The auc metric includes both the Area Under the Receiver Operating Characteristic Curve (AUROC) and the Area Under the Precision-Recall Curve (AUPRC). Points positioned in the top right of the plot correspond to better performance scores, while those near the dashed lines indicate lower performance. The dashed lines represent the expected performance of a random classifier. As expected, the random method waggr performs the worst.

Now, here is the plot for the fscore metric.

dc.bm.pl.fscore(df, hue="method")

../../_images/12e6dbeeaaf8728f0ab49cce5d76ce1a985843174fe2bd14e3c717fdb1d5a5c0.png

The fscore metric includes precision and recall, calculated after filtering for significant enrichment scores. As with the auc metric, points located in the top right of the plot indicate better overall performance. In this case, certain methods, such as aucell, exhibit high precision but low recall, while others show the opposite trend, such as ora. Methods such as ulm and mlm achieve a balanced performance with reasonable values for both metrics. As expected, the waggr method, representing a random baseline, performs worse than all other methods.

Finally, here is the plot for the qrank metric.

dc.bm.pl.qrank(df, hue="method")

../../_images/b587636e6d16f96037e2c019c33c121366e360fc23fc7bf7902202b54d47599e.png

The qrank metric comprises two components, the inverse of the quantile-normalized rank of the ground truth sources (1-qrank) and the log-transformed one-sided p-value obtained from testing for differences between the ranks of sources belonging to the ground truth and those that do not (-log10(pval)). Higher values of 1-qrank and of −log10(pval) indicate better performance. Dashed lines represent the expected values for a random method, 0.5 for 1-qrank and -log10(0.05) for -log10(pval). As in previous cases, the waggr baseline behaves as a random classifier.

These pair of metrics can then be summarized using the harmonic mean between them, generating 3 aggragte metrics.

hdf = dc.bm.metric.hmean(df)
hdf

	method	auprc	auroc	H(auroc, auprc)	precision	recall	F-score	-log10(pval)	1-qrank	H(1-qrank, -log10(pval))	score
0	aucell	0.637162	0.675000	0.644387	0.947368	3.000000e-01	0.661765	3.876148	0.705000	2.040490	0.462240
1	consensus	0.931623	0.939259	0.933140	0.000000	7.733222e-17	0.000000	19.312727	0.863333	3.661881	0.777099
2	gsea	0.713912	0.830741	0.734573	0.644444	4.833333e-01	0.604167	10.908073	0.793333	3.072753	0.730810
3	gsva	0.745393	0.742407	0.744794	0.866667	4.333333e-01	0.722222	8.068603	0.763333	2.768867	0.686519
4	mdt	0.640447	0.682778	0.648488	0.500000	2.500000e-01	0.416667	4.313612	0.713333	2.146691	0.428243
5	mlm	0.935272	0.941667	0.936544	0.928571	6.500000e-01	0.855263	19.775370	0.866667	3.686995	1.000000
6	ora	0.601464	0.665370	0.613244	0.400000	1.000000e+00	0.454545	4.575803	0.711667	2.193641	0.440749
7	udt	0.713647	0.827870	0.733899	0.500000	2.500000e-01	0.416667	9.910984	0.783333	2.975857	0.658912
8	ulm	0.935204	0.939630	0.936085	0.933333	4.666667e-01	0.777778	19.775370	0.866667	3.686995	0.980342
9	viper	0.927177	0.930556	0.927851	0.925000	6.166667e-01	0.840909	18.404501	0.856667	3.611012	0.975023
10	waggr	0.567476	0.559815	0.565927	0.000000	0.000000e+00	0.000000	1.077624	0.640000	0.947981	0.000000
11	zscore	0.856637	0.833333	0.851873	0.933333	4.666667e-01	0.777778	14.204125	0.823333	3.341837	0.872048

And these can also be visualized.

dc.bm.pl.bar(df=hdf, x="H(auroc, auprc)", y="method", hue="method")
dc.bm.pl.bar(df=hdf, x="F-score", y="method", hue="method")
dc.bm.pl.bar(df=hdf, x="H(1-qrank, -log10(pval))", y="method", hue="method")

../../_images/e6accb4fea25aa166e0b5d97a997485c046c05a6ce90f103234d9ac814c344db.png

../../_images/05709eee7e62251f8b34387dde2e7e5457dbb532cda794ae50fd1a8e572d56db.png

../../_images/b97b8868299477abced34f8ebdb6debff0a9da2f036be4c4c348edd9bae9f2e1.png

Finally, the individual summary metrics can be aggregated into a single overall score, calculated as the quantile-normalized rank of their mean.

dc.bm.pl.summary(df=hdf, y="method", figsize=(6, 3))

../../_images/7c0a60e2bdac44c05222fdebd28514faa05473d8a16bb105fa16931570de7844.png

In summary, enrichment methods based on simple linear models, mlm and ulm, outperform other approaches in distinguishing perturbed from unperturbed TFs.

These findings are consistent with those reported in the original publication of decoupler, which used real TF perturbation data during evaluation [BiMVSB+22].

Evaluation of Networks#

Loading the Dataset#

adata = dc.ds.knocktf()
adata

AnnData object with n_obs × n_vars = 388 × 21985
    obs: 'source', 'Species', 'Knock.Method', 'Biosample.Name', 'Profile.ID', 'Platform', 'TF.Class', 'TF.Superclass', 'Tissue.Type', 'Biosample.Type', 'Data.Source', 'Pubmed.ID', 'logFC', 'type_p'

The obtained anndata.AnnData consist of log2 fold changes for ~22k genes across ~1k experiments.

We can inspect the cell metadata stored in anndata.AnnData.obs.

adata.obs

	source	Species	Knock.Method	Biosample.Name	Profile.ID	Platform	TF.Class	TF.Superclass	Tissue.Type	Biosample.Type	Data.Source	Pubmed.ID	logFC	type_p
DataSet_01_006	PROX1	Homo sapiens	siRNA	Primary lymphatic endothelial cells	GSE12846	GPL570	Homeo domain factors	Helix-turn-helix domains	Foreskin	Primary cell	GEO	18815287	-1.139770	-1
DataSet_01_007	NR2F2	Homo sapiens	siRNA	Primary lymphatic endothelial cells	GSE12846	GPL570	Nuclear receptors with C4 zinc fingers	Zinc-coordinating DNA-binding domains	Foreskin	Primary cell	GEO	18815287	-1.346960	-1
DataSet_01_008	WT1	Homo sapiens	siRNA	PLC/PRF/5	GSE12886	GPL570	C2H2 zinc finger factors	Zinc-coordinating DNA-binding domains	Liver	Cell line	GEO	19190340	-1.114740	-1
DataSet_01_010	TP53	Homo sapiens	shRNA	iPS cells	GSE13334	GPL4133	p53 domain factors	Immunoglobulin fold	Dermal_fibroblasts	Induced pluripotent stem cell	GEO	19668191	-3.137450	-1
DataSet_01_011	TP53	Homo sapiens	siRNA	U251	GSE13991	GPL1426	p53 domain factors	Immunoglobulin fold	Central_nervous_system	Cell line	GEO	19139068	-2.438950	-1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
DataSet_04_032	FUBP1	Homo sapiens	shRNA	HepG2	ENCSR736TAW	-	Others	ENCODE_TF	Liver	Cell line	ENCODE	22955616	-1.270010	-1
DataSet_04_036	SMAD2	Homo sapiens	CRISPR	HepG2	ENCSR036ANR	-	SMAD/NF-1 DNA-binding domain factors	beta-Hairpin exposed by an alpha/beta-scaffold	Liver	Cell line	ENCODE	22955616	-1.559244	-1
DataSet_04_037	SMAD4	Homo sapiens	CRISPR	HepG2	ENCSR700FWD	-	SMAD/NF-1 DNA-binding domain factors	beta-Hairpin exposed by an alpha/beta-scaffold	Liver	Cell line	ENCODE	22955616	-1.404555	-1
DataSet_04_039	SRPK2	Homo sapiens	CRISPR	HepG2	ENCSR929EIP	-	Others	ENCODE_TF	Liver	Cell line	ENCODE	22955616	-1.392551	-1
DataSet_04_041	YBX1	Homo sapiens	CRISPR	HepG2	ENCSR548OTL	-	Cold-shock domain factors	beta-Barrel DNA-binding domains	Liver	Cell line	ENCODE	22955616	-2.025170	-1

388 rows × 14 columns

Loading the Network#

For this example, we will load two prior knowledge gene regulatory networks, CollecTRI [MullerDTV+23] and DoRoThEA [GAHI+19] .

ct = dc.op.collectri()
do = dc.op.dorothea()

ct.head()

	source	target	weight	resources	references	sign_decision
0	MYC	TERT	1.0	DoRothEA-A;ExTRI;HTRI;NTNU.Curated;Pavlidis202...	10022128;10491298;10606235;10637317;10723141;1...	PMID
1	SPI1	BGLAP	1.0	ExTRI	10022617	default activation
2	SMAD3	JUN	1.0	ExTRI;NTNU.Curated;TFactS;TRRUST	10022869;12374795	PMID
3	SMAD4	JUN	1.0	ExTRI;NTNU.Curated;TFactS;TRRUST	10022869;12374795	PMID
4	STAT5A	IL2	1.0	ExTRI	10022878;11435608;17182565;17911616;22854263;2...	default activation

do.head()

	source	target	weight	confidence
0	MYC	TERT	1.0	A
1	FOS	NTS	1.0	A
2	FOS	NTF3	1.0	A
3	FOS	NFKB1	-1.0	A
4	FOS	NEFL	1.0	A

Evaluation#

In this case, two analogous random networks will be generated to serve as random baselines.

rct = dc.pp.shuffle_net(ct)
rdo = dc.pp.shuffle_net(do)

And also each original network but ignoring their mode of regulation (weight) information.

unw_ct = ct.drop(columns="weight")
unw_do = do.drop(columns="weight")

To supply multiple networks to the benchmarking function, they must first be organized into a dictionary of networks. Additionally, only the ulm method is used in this case to reduce computational time.

df = dc.bm.benchmark(
    adata=adata,
    net={
        "collectri": ct,
        "dorothea": do,
        "unw_collectri": unw_ct,
        "unw_dorothea": unw_do,
        "r_collectri": rct,
        "r_dorothea": rdo,
    },
    kws_decouple={
        "methods": "ulm",
    },
)

df.head(20)

	net	metric	score
0	collectri	auroc	0.719133
1	collectri	auprc	0.760249
2	collectri	precision	0.007339
3	collectri	recall	0.315412
4	collectri	1-qrank	0.728525
5	collectri	-log10(pval)	39.321597
0	dorothea	auroc	0.687706
1	dorothea	auprc	0.721169
2	dorothea	precision	0.010575
3	dorothea	recall	0.359813
4	dorothea	1-qrank	0.692722
5	dorothea	-log10(pval)	21.809839
0	unw_collectri	auroc	0.658961
1	unw_collectri	auprc	0.705306
2	unw_collectri	precision	0.005095
3	unw_collectri	recall	0.286738
4	unw_collectri	1-qrank	0.674280
5	unw_collectri	-log10(pval)	23.350754
0	unw_dorothea	auroc	0.626609
1	unw_dorothea	auprc	0.678251

Results across the different metrics can then be visualized. Here is the plot for the auc metric.

dc.bm.pl.auc(df, hue="net")

../../_images/f9e038a7c76e1cc2aecd08f48bfc0886e3189e4db96599842e236047e99ab2bd.png

Similarly, here is the plot for the fscore metric.

dc.bm.pl.fscore(df, hue="net")

../../_images/56f155924edcba269d422736f42a2c82e5ef1635a382441de6fc2c8fc85c31c4.png

And for the qrank metric.

dc.bm.pl.qrank(df, hue="net")

../../_images/00e7f7c40d0717933875deb2b70f03b89da30141d366f72f91b2f15c1b40f2b8.png

As expected, the random networks are unable to distinguish between perturbed and unperturbed TFs based on the obtained enrichment scores.

Next, the summary statistics can be generated.

hdf = dc.bm.metric.hmean(df)
hdf

	net	auprc	auroc	H(auroc, auprc)	precision	recall	F-score	-log10(pval)	1-qrank	H(1-qrank, -log10(pval))	score
0	collectri	0.760249	0.719133	0.751654	0.007339	0.315412	0.009120	39.321597	0.728525	3.391298	1.000000
1	dorothea	0.721169	0.687706	0.714218	0.010575	0.359813	0.013123	21.809839	0.692722	3.073169	0.897834
2	r_collectri	0.553499	0.536420	0.549997	0.002139	0.075269	0.002655	2.311545	0.545284	1.402780	0.361641
3	r_dorothea	0.523412	0.495025	0.517477	0.005282	0.103286	0.006519	0.161909	0.491971	0.187001	0.000000
4	unw_collectri	0.705306	0.658961	0.695523	0.005095	0.286738	0.006340	23.350754	0.674280	3.022308	0.875649
5	unw_dorothea	0.678251	0.626609	0.667253	0.008210	0.317757	0.010196	12.184879	0.641430	2.649298	0.760155

And visualized.

dc.bm.pl.bar(df=hdf, x="H(auroc, auprc)", y="net", hue="net")
dc.bm.pl.bar(df=hdf, x="F-score", y="net", hue="net")
dc.bm.pl.bar(df=hdf, x="H(1-qrank, -log10(pval))", y="net", hue="net")

../../_images/4ff9e4b620c770829ab8bc49418b6192ebc15324c2b0d4aa38ef6b5c0be06345.png

../../_images/06b70a6eabedc366f00c5a3b3ad75abf160978667b28096c0500d074c803e185.png

../../_images/32354291efe78a2ed12e2bc313402adec02c999c2b23bcd4dbe470f3c61e25eb.png

Finally, the individual summary metrics can be aggregated into a single overall score, calculated as the quantile-normalized rank of their mean.

dc.bm.pl.summary(df=hdf, y="net", figsize=(7, 3))

../../_images/376ac7e0a111ae6c710932ec09cb3c9c5ea959614c71bb1e94029f836e67fb31.png

As expected, either removing the mode of regulation (weight) or shuffling target genes leads to a decrease in evaluation performance.

Benchmarking pipeline

Contents

Benchmarking pipeline#

Evaluation metrics#

Area Under the Curve#

Precision and Recall (F-Score)#

Quantile Normalized Ranks#

Loading Packages#

Evaluation of methods#

Loading the dataset#

Evaluation#

Evaluation of Networks#

Loading the Dataset#

Loading the Network#

Evaluation#