decoupler.mt.gsva

Contents

decoupler.mt.gsva#

decoupler.mt.gsva = <decoupler._Method.Method object>#

Gene Set Variation Analysis (GSVA) [HanzelmannCG13].

Each feature is first transformed and smoothed using a kernel density estimation method:

  • Gaussian

  • Poisson

  • Empirical cumulative distribution function

Features are then ranked based on a continuous metric (e.g., expression value, score, or correlation).

Then, a score for each feature in a set is computed by walking down the ranked list, increasing a running-sum statistic when a feature belongs to the set and decreasing it otherwise.

δ(F,i)={rijFrjif feature iF1lif feature iF\delta(F, i) = \begin{cases} \frac{|r_i|}{\sum\limits_{j \in F} |r_j|} & \text{if feature } i \in F \\ -\frac{1}{l} & \text{if feature } i \notin F \end{cases}

Where:

  • FF is a feature set

  • rr is the ranking of the feature statistics in descending order

  • rir_i is the value for feature ii

  • rjr_j is the value for feature jj in FF

  • kk is the number of features in FF

  • NN is the total number of features in rr

  • l=Nkl=N-k is the number of features not in FF but present in rr

For each feature, the function δ(F,i)\delta(F,i) is applied and stored as a sequence LL.

L=δ(F,i) for i=1, 2, ... , NL = \delta(F, i)\text{ for i} = \text{1, 2, ... , N}

The enrichment score ESES is computed as the sum of the maximum positive and maximum negative deviations of the running-sum statistic from zero.

ES=max1iNLi+min1iNLiES = \max_{1 \leq i \leq N} L_i + \min_{1 \leq i \leq N} L_i

This method does not perform statistical testing on ESES and therefore does not return pvaluep_{value}.

Parameters:
  • data

    anndata.AnnData instance, pandas.DataFrame, or a tuple of (matrix, samples, features). All methods assume that input values follow a normal distribution unless otherwise specified. Therefore, when working with observational count data, some form of normalization is required (e.g., scanpy’s library-size normalization followed by log1p). Using raw integer counts is not recommended, as they follow a Poisson distribution.

    Feature scaling on normalized counts is also acceptable, but note that it changes the results by assuming equal importance across features, and outcomes will vary depending on which observations are included.

    No normalization or transformation is required when using contrast-level feature statistics such as log fold changes or Wald test statistics.

  • net – Dataframe in long format. Must include source and target columns, and optionally a weight column.

  • tmin (default: 5) – Minimum number of targets per source. Sources with fewer targets will be removed.

  • layer – Layer key name of an anndata.AnnData instance.

  • raw (default: False) – Whether to use the .raw attribute of anndata.AnnData.

  • empty (default: True) – Whether to remove empty observations (rows) or features (columns).

  • bsize (default: 250000) – For large datasets in sparse format, this parameter controls how many observations are processed at once. Increasing this value speeds up computation but uses more memory.

  • verbose (default: False) – Whether to display progress messages and additional execution details.

  • kcdf – Which kernel to use during the non-parametric estimation of the cumulative distribution function. Options are gaussian, poisson or None. The default is gaussian.

  • mx_diff – Changes how the enrichment statistic (ES) is calculated. If True (default), ES is calculated as the difference between the maximum positive and negative random walk deviations. If False, ES is calculated as the maximum positive to 0.

  • abs_rnk (bool) – Used when mx_diff=True. If False (default), the enrichment statistic (ES) is calculated taking the magnitude difference between the largest positive and negative random walk deviations. If True, feature sets with features enriched on either extreme (high or low) will be regarded as ‘highly’ activated.

Returns:

Enrichment scores ESES and, if applicable, adjusted pvaluep_{value} by Benjamini-Hochberg.

Example

import decoupler as dc
adata, net = dc.ds.toy()
dc.mt.gsva(adata, net, tmin=3)