All Articles IPA CLC Genomics Server CLC Genomics Workbench HGMD

Graphical Summary

« Go Back


SummarySummary of key results from an IPA Core Analysis presented as a network
Graphical Summary algorithm

The aim of the Graphical Summary is to provide a quick overview of the major biological themes in your IPA Core Analysis and to illustrate how those concepts relate to one another. This feature selects and connects a subset of the most significant entities predicted in the analysis, creating a coherent and comprehensible synopsis of the analysis. The Graphical Summary can include entities such as canonical pathways, upstream regulators, diseases, and biological functions. The algorithm that constructs the summary uses machine learning techniques to prioritize and connect entities that are in some cases not yet connected by findings in the QIAGEN Knowledge Graph. Such inferred relationships help you visualize related biological activities.

Below is the Graphical Summary for an analysis of a set of aggressive “claudin-low” breast cancer cell lines contrasted to less aggressive luminal A type breast cancer cell lines (based on PMID 20813035). Key aspects of these claudin-low cells are that they have activated the epithelial-mesenchymal transition (EMT) through activation of specific transcription factors such as ZEB1, SNAI1, and SNAI2 and have tissue-invasive tendencies, which are well represented in the generated summary:

Inferred edges in the Graphical Summary, such as the one between CST5 and ZEB1 at the lower right, are shown with dotted lines (rather than solid or dashed) and are not used elsewhere in IPA currently. The section at the end of this article goes into detail about the network is designed. As shown above, the analysis predicts ZEB1 is activated (i.e., it has a positive z-score and is therefore colored orange) and CST5 is inhibited (i.e., negative z-score and therefore blue). The inferred edge between the nodes and their activation states suggests that ZEB1 activation and CST5 inhibition have a similar effect on downstream molecules.

How to change the Graphical Summary (if desired)
Currently, the nodes in the Graphical Summary can be changed in two ways: 1) by deleting nodes and / or edges manually, or 2) by re-running the algorithm with a different overall size setting.

To re-run and change the size, click the  button in the toolbar and use the size slider to increase or decrease the number of nodes appearing in the Graphical Summary. The default position is in the center of the slider.

Graphical user interface, text, application, email  Description automatically generated

Changes made by revising the summary with the slider are saved with your analysis automatically, whereas manually deleting or repositioning nodes or edges requires you to click the Save  button on the toolbar if you want to retain those changes after closing your analysis. Note that revising the summary algorithmically with the slider may bring back nodes that you may have deleted manually. If you wish to use Build or Overlay tools on the Graphical Summary, you will need to select all the nodes and copy/paste them into a new My Pathway. Pasted networks will not contain any of the inferred edges.

You can change the font size of the node labels in the summary by clicking this button  in the toolbar and using the pull-down menu next to “Node Name Font Size”. You can change the overall layout of the nodes by clicking the Change Layout button  and choosing an option. For example, the same summary is shown below with the Subcellular layout.

Chart  Description automatically generated

Overall strategy of the Graphical Summary
The algorithm is a heuristic that uses a number of factors to select and connect entities from an analysis. In brief, it is optimized to create a manageable network that brings together the most significantly activated or inhibited upstream regulators, diseases, functions, and pathways from your analysis and present them in a way that reduces the redundancy of the predictions and is not overly connected. For Canonical Pathways, the most significant ones via the p-value are allowed to participate, even if they do not have a strong z-score. For upstream regulators, only genes, mRNAs, and proteins are used (i.e., no compounds, groups, or complexes) and the algorithm takes into account the magnitude of the differential expression or phosphorylation of the corresponding gene or protein in the dataset when deciding which regulators to include. For example, if IPA has predicted MEF2C to be an activated upstream regulator, and MEF2C is highly up-regulated in your dataset, that regulator will be prioritized to be included in the Graphical Summary over an upstream regulator that has an equally high z-score but is only weakly differentially expressed in your dataset, or not differentially expressed at all.

Graphical Summary creation:

The graphical summary uses a precomputed table containing inferred relationships between molecules, functions, diseases, and pathways. These relationships were obtained and scored by a machine learning algorithm operating entirely on prior knowledge. Networks are constructed from analysis results using a heuristic graph algorithm. Both the graph heuristic and the content-based machine learning approach are described in more detail below.


Graph algorithm

The graph algorithm consists of the following steps:
1. Gather top entities from the Core Analysis. Canonical pathways (aka “pathways”), upstream regulators (aka “regulators”), and functions/diseases (aka “processes”) are included and must meet the following criteria:
a. All entities have p-value <= 0.05
b. Regulators and processes have |z-score| >= 2
​​​​Regulators are limited to genes only (e.g., no chemicals)  

2. Build a graph G in which nodes are the top entities from Step 1 and edges are inferred relationships from the machine learning model that have an effect (i.e., activation or inhibition) consistent with the pattern of activity predicted by the analysis (i.e., the z-scores of the entities in the analysis). 

3. Build an induced subgraph G’ from G consisting of:

a. Only entities involved in the top N inferred edges (ranked by score)
b. All inferred edges connecting those entities

The score of an inferred edge combines the normalized, weighted values of the following:     
1) The edge’s Machine Learning (ML)-precomputed score
2) The predicted z-scores of the edge’s two endpoints
3) The log of the differential expression of the regulator in the dataset (if it's in the dataset).

4. Trim edges from G’, keeping only those edges that are part of the maximum spanning tree, as determined using Kruskal’s algorithm.

Where possible, support inferred edges in G’ with curated findings from literature (i.e., edges from the QIAGEN Knowledge Graph) and connect additional node pairs with such findings until reaching a specified density (i.e., % of all possible connected node pairs).

If not already present, add top pathways and regulators if they can be directly connected to entities in the network.

Content-based machine learning

The algorithm is based on the idea that changes in a gene or protein's expression or phosphorylation are caused by a given upstream molecule that encodes its effect(s) on biological function. For example, if two genes A and B have a similar effect on the expression or transcription of a set of downstream genes, and we know from the literature that A impacts a particular disease in a certain direction, then we can infer that gene B might have a similar impact on the disease as gene A.

In the machine-learning model, molecules, functions, diseases, and pathways are encoded as “embeddings” in an N-dimensional vector space (here: N=100). These high-dimensional embeddings reflect complex relationships between entities and are obtained from a model that is “trained” on observed cause-effect relationships from the literature.

Gene (and other molecule) embeddings are computed from literature-derived gene/protein expression or phosphorylation relationships using a low-rank matrix approximation as indicated in the diagram below (items a, b). These embeddings are in turn used as feature vectors in a linear regression model trained on signed molecular-function effects from the literature (items c, d in the diagram). Normalized N-dimensional parameter vectors from this linear model then serve as embedding vectors for functions. The same approach is also applied to compute embeddings for pathways.

Diagram  Description automatically generated

The vectors associated with each entity, whether molecule, process or pathway, can be used to compute similarity. In this way, for example, an inferred process to process edge can be created between two similar processes, or a process to pathway edge can be created between a process and a pathway with similar vectors. Similarity (or anti-similarity) scores are computed as the scalar product of two embedding vectors (also called “cosine-similarity”), and inferred edges are created by applying a suitable (absolute) cut off. This also applies to inferred molecule-function relationships, but in this case the predicted relationships can also be interpreted as being causal.
TitleGraphical Summary
URL NameGraphical-Summary