decoupler.mt.mlm

Contents

decoupler.mt.mlm#

decoupler.mt.mlm = <decoupler._Method.Method object>#

Multivariate Linear Model (MLM) [BiMVSB+22].

This approach uses the molecular features from one observation as the population of samples and it fits a linear model with with multiple covariates, which are the weights of all feature sets FF.

yi=β0+β1x1i+β2x2i++βpxpi+εy^i = \beta_0 + \beta_1 x_{1}^{i} + \beta_2 x_{2}^{i} + \cdots + \beta_p x_{p}^{i} + \varepsilon

Where:

  • yiy^i is the observed feature statistic (e.g. gene expression, log2FClog_{2}FC, etc.) for feature ii

  • xpix_{p}^{i} is the weight of feature ii in feature set FpF_p. For unweighted sets, membership in the set is indicated by 1, and non-membership by 0.

  • β0\beta_0 is the intercept

  • βp\beta_p is the slope coefficient for feature set FpF_p

  • ε\varepsilon is the error term for feature ii

Multivariate Linear Model (MLM) schematic.

Multivariate Linear Model (MLM) scheme. In this example, the observed gene expression of Sample1Sample_1 is predicted using the interaction weights of two pathways, P1P_1 and P2P_2. For P2P2, since its target genes that have negative weights are lowly expressed, and its positive target genes are highly expressed, the relationship between the two variables is positive so the obtained ESES score is positive. Scores can be interpreted as active when positive, repressive when negative, and inconclusive when close to 0.#

The enrichment score ESES for each FF is then calculated as the t-value of the slope coefficients.

ES=tβ1=β^1SE(β^1)ES = t_{\beta_1} = \frac{\hat{\beta}_1}{\mathrm{SE}(\hat{\beta}_1)}

Where:

  • tβ1t_{\beta_1} is the t-value of the slope

  • SE(β^1)\mathrm{SE}(\hat{\beta}_1) is the standard error of the slope

Next, pvaluep_{value} are obtained by evaluating the two-sided survival function (sfsf) of the Student’s t-distribution.

pvalue=2×sf(ES,df)p_{value} = 2 \times \mathrm{sf}(|ES|, \text{df})
Parameters:
  • data

    anndata.AnnData instance, pandas.DataFrame, or a tuple of (matrix, samples, features). All methods assume that input values follow a normal distribution unless otherwise specified. Therefore, when working with observational count data, some form of normalization is required (e.g., scanpy’s library-size normalization followed by log1p). Using raw integer counts is not recommended, as they follow a Poisson distribution.

    Feature scaling on normalized counts is also acceptable, but note that it changes the results by assuming equal importance across features, and outcomes will vary depending on which observations are included.

    No normalization or transformation is required when using contrast-level feature statistics such as log fold changes or Wald test statistics.

  • net – Dataframe in long format. Must include source and target columns, and optionally a weight column.

  • tmin (default: 5) – Minimum number of targets per source. Sources with fewer targets will be removed.

  • layer – Layer key name of an anndata.AnnData instance.

  • raw (default: False) – Whether to use the .raw attribute of anndata.AnnData.

  • empty (default: True) – Whether to remove empty observations (rows) or features (columns).

  • bsize (default: 250000) – For large datasets in sparse format, this parameter controls how many observations are processed at once. Increasing this value speeds up computation but uses more memory.

  • verbose (default: False) – Whether to display progress messages and additional execution details.

  • tval – Whether to return the t-value (tval=True) the coefficient of the fitted model (tval=False).

Returns:

Enrichment scores ESES and, if applicable, adjusted pvaluep_{value} by Benjamini-Hochberg.

Example

import decoupler as dc

adata, net = dc.ds.toy()
dc.mt.mlm(adata, net, tmin=3)