Recent technologies allow us to obtain measurements of thousands of transcripts, proteins or metabolites per sample. There are typically much fewer number of samples than measured features. This results in a very limited statistical power for analyses such as linking molecular features to phenotypes or to the efficacy of a therapy.
A strategy to address this issue is to summarise these data sets via a smaller number of features. We develop approaches to extract such features in a way that they can be directly mapped to biochemically distinct elements (see Fig.). This is a powerful strategy to 1) reduce the number of variables and 2) yield interpretable biological features that can be experimentally validated.
Genomic data such as somatic mutations in cancer can be condensed into information about which pathways are mutated in a statistically significant way. We developed the tool SLAPEnrich (Iorio et al. 2018) to perform such analyses.
In the case of gene expression, a number of functional features can be extracted. A common approach is to map gene expression on the corresponding genes of given pathways. However, the correlation between gene expression and the function of the corresponding protein is often poor. An alternative strategy is to consider genes known to be affected by perturbation of pathways (so-called pathway footprints (Dugourd and Saez-Rodriguez 2019). For this purpose we developed the tool PROGENy, comprising footprints for 14 signaling pathways. (Schubert et al. 2018; Holland, Szalai, and Saez-Rodriguez 2019).
Additionally, it is possible to use gene expression data to estimate the activity status of transcription factor (TF) proteins by looking at the mRNA levels of its direct targets (TF footprint). DoRothEA is a resource containing TF-target interactions integrated from different types of evidences. Using those interactions with statistical enrichment analysis allows the estimation of TF activities from gene expression data (Garcia-Alonso et al. 2018).
We have shown that the footprints of signaling pathways and TFs on gene expression are evolutionarily conserved between humans and the widely used model organism Mus musculus (Holland, Szalai, and Saez-Rodriguez 2019). This opens up the possibility to functionally characterize mouse besides human data data using PROGENy and DoRothEA.
With the emergence of single cell RNA-seq technologies, many corresponding datasets are being generated, where funcitonal insights such as pathway and TF activity could be gained at the single-cell level. However. it is not clear how well existing tools can handle the unique challenges of this data type (e.g. drop-out events, low library size). We have shown that our bulk and footprint based tools PROGENy and DoRothEA can be applied on scRNA-seq data outperforming dedicated single cell tools (Holland et al. 2019).
Once their activities have been estimated, transcription factors and signaling pathways signatures can be linked to upstream perturbations (drugs, ligands, mutations, etc.) using network-based approaches such as so-called causal reasoning methods. Here, we use the signed, directed interactions in Omnipath as a prior knowledge network and we make use of an integer linear programming (ILP) mathematical formulation to infer regulatory signalling network topology from gene expression data (Melas et al. 2015). The pipeline also integrates TF and pathway scores from DoRothEA and PROGENy for network contextualization and we compiled the whole framework as an R-package called CARNIVAL (Liu, Trairatphisan, Gjerga et al. 2019).
In a similar manner to the transcription factors, activity of kinases can be inferred from the levels of phosphorylation of the proteins they target (Wirbel, Cutillas, and Saez-Rodriguez 2018; Hernandez-Armenta et al. 2017, https://github.com/saezlab/kinact). By taking advantage of the conceptual similarities between footprint approaches, we hope to expand the use of CARNIVAL to more diverse type of omic data. We are currently trying to integrate transcriptomic, phosphoproteomic and metabolomic with this approach.
For subsequent downstream analyses, molecular entities can also be annotated with ontology terms to provide additional information. However, their coverage is still incomplete. We therefore developed a computational pipeline which combines protein network-association, text mining, and putative regulator identification from molecular footprints to mitigate this issue (Souza, Trairatphisan, Pinero et al. 2018).
Figure 1. Selected approaches to extract functional features from ‘omics’ data: pathways enriched in mutations (red ); transcription factors differentially activated (brown), and pathway activities as determined by their footprints on gene expression (pink).