Recent technologies allow us to obtain measurements of thousands of transcripts, proteins, or metabolites per sample. There are typically much fewer number of samples than measured features. This results in a very limited statistical power for analyses such as linking molecular features to phenotypes or to the efficacy of a therapy.
A strategy to address this issue is to summarise these data sets via a smaller number of features. We develop approaches to extract such features in a way that they can be directly mapped to biochemically distinct elements (see Fig.). This is a powerful strategy to 1) reduce the number of variables and 2) yield interpretable biological features that can be experimentally validated.
Genomic data such as somatic mutations in cancer can be condensed into information about which pathways are mutated in a statistically significant way. We developed the tool SLAPEnrich (Iorio et al. 2018) to perform such analyses.
In the case of gene expression, a number of functional features can be extracted. A common approach is to map gene expression on the corresponding genes of given pathways. However, the correlation between gene expression and the function of the corresponding protein is often poor. An alternative strategy is to consider genes known to be affected by perturbation of pathways (so-called pathway footprints (Dugourd and Saez-Rodriguez 2019). For this purpose we developed the tool PROGENy, comprising footprints for 14 signaling pathways. (Schubert et al. 2018; Holland, Szalai, and Saez-Rodriguez 2019).
Additionally, it is possible to use gene expression data to estimate the activity status of transcription factor (TF) proteins by looking at the mRNA levels of its direct targets (TF footprint). DoRothEA is a resource containing TF-target interactions integrated from different types of evidence. Using those interactions with statistical enrichment analysis allows the estimation of TF activities from gene expression data (Garcia-Alonso et al. 2019) We have shown that the footprints of signaling pathways and TFs on gene expression are evolutionarily conserved between humans and the widely used model organism Mus musculus (Holland, Szalai, and Saez-Rodriguez 2019). This opens up the possibility to functionally characterize mouse besides human data using PROGENy and DoRothEA.
With the emergence of single-cell RNA-seq technologies, many corresponding datasets are being generated, where functional insights such as pathway and TF activity could be gained at the single-cell level. However. it was not clear how well existing tools can handle the unique challenges of this data type (e.g. drop-out events, low library size). We have shown that our bulk and footprint based tools PROGENy and DoRothEA can be applied on scRNA-seq data partially outperforming dedicated single-cell tools (Holland et al. 2020).
Once their activities have been estimated, transcription factors and signaling pathways signatures can be linked to upstream perturbations (drugs, ligands, mutations, etc.) using network-based approaches such as so-called causal reasoning methods. Here, we use the signed, directed interactions in Omnipath as a prior knowledge network and we make use of an integer linear programming (ILP) mathematical formulation to infer regulatory signalling network topology from gene expression data. The pipeline also integrates TF and pathway scores from DoRothEA and PROGENy for network contextualization and we compiled the whole framework as a Bioconductor package called CARNIVAL (Liu et al. 2019).
In a similar manner to the transcription factors, the activity of kinases can be inferred from the levels of phosphorylation of the proteins they target (Wirbel, Cutillas, and Saez-Rodriguez 2018; Hernandez-Armenta et al. 2017), https://github.com/saezlab/kinact). By taking advantage of the conceptual similarities between footprint approaches, we hope to expand the use of CARNIVAL to more diverse types of omic data. We are currently trying to integrate transcriptomic, phosphoproteomic, and metabolomic with this approach.
Figure. Selected approaches to extract functional features from ‘omics’ data: pathways enriched in mutations (red ); transcription factors differentially activated (brown), and pathway activities as determined by their footprints on gene expression (pink).
You can watch here a summary of our activities in this area in the context of single-cell and spatial data.