I. Introduction

Identifying causal genes for complex human diseases is a major challenge in the field of human genetics. As sequencing technologies become cheaper, it is becoming easier to collect genome scale data for large numbers of individuals. However, due to the natural variation across individuals, millions of genetic variants are identified through screening. Many of these variants are of little to no use for predicting the genetic causes of complex diseases, so finding causal genes using only gene mutations becomes very challenging.

Other studies have looked at the gene expression levels, which are indicators of how many of each gene’s mRNA transcripts are present in the cells [1]. The biggest problem with this data is that it is very noisy and depends on a variety of environmental factors, so it is not always suitable for predicting causal genes.

Recently, Mezlini et al. proposed a method that integrates gene expression and exome data to infer causal genes for complex human diseases [2]. This approach creates a hierarchical graphical model that jointly models the phenotype (disease/no disease) and gene states for all genes for each individual. A gene state is inferred from the combination of genotype (mutations) and gene expression data for each gene. This graphical model provides a complementary alternative to other approaches that use only one of the data types, i.e. exome or expression data. Combining these two results in a robust model which not only detects genes associated with the disease, but also implicates proteins that are affected in the population. However, performing inference on such a graphical model becomes a computationally intensive and time consuming task when scaled to a large number of patients.

In this paper, we propose an efficient implementation of the method scalable to thousands of individuals. Ultimately, this will help in the identification of causal genes in complex human diseases.

II. Implementation

Implemented in R, the original probabilistic graphical model developed by Mezlini et al. is a biologically motivated hierarchical factor graph [3] that uses exome variants and mRNA expression levels as predictor variables. The graph contains many high-order factors that encode probabilistic relationships between different variables. It also contains a variety of regularization factors and factors encoding prior knowledge to build a complex and powerful model. For inference, it uses a loopy belief propagation algorithm to jointly infer the marginal distributions of unobserved variables (e.g., identity of causal genes).

Time complexity for passing messages in a loopy belief propagation framework increases exponentially with the number of variables connected to the factor. Many of the high-order factors in the graphical model have several variables connected to a single variable that is only dependent on their sum. Such factors are called cardinality potentials. The original implementation uses Poisson-Binomial estimation to approximate this sum and then creates a lower order factor to pass on the messages. The time complexity for this calculation is \(O(n^2 \log(n))\).

In our more efficient implementation, we used OpenGM, a C++ template library that provides an interface to create a bipartite factor graph that connects all factors to the corresponding variables [4]. An implementation in a compiled language like C++ removes the overhead associated with an interpreted language like R. We also implemented an improved version of the loopy belief propagation algorithm, which takes the factor graph generated by OpenGM as an input and performs inference by passing messages in a specific order, allowing the algorithm to converge faster.

In addition to the implementation in C++, another optimization we made was in the evaluation of cardinality potentials. Using a divide-and-conquer approach as described by Tarlow et. al [5], we transformed the high order cardinality potential to a low order tree-structure by adding auxiliary variables. Then we framed the inference as a standard belief propagation algorithm that runs in \(O(n \log^2(n))\) time. This general class of models is called the Recursive Cardinality Model [5].

III. Analysis

To predict causal genes for a disease, the method needs to be trained on a dataset that contains individuals with and without the disease. Because the exome and expression data is not readily available for the same set of individuals, we used simulation data to compare the runtime of two implementations. Simulation data consisted of variants from 6900 genes for 400 individuals (200 with disease and 200 without the disease). We recorded a decrease of 40% in the overall runtime of the algorithm on average. Most of this is attributed to the optimization of the message passing in the cardinality potentials.

IV. Conclusion

In this paper, we proposed an efficient implementation of the graphical model that combines two complementary data types to build a powerful model for the identification of causal genes. We achieved a significant reduction in the runtime, which makes the method more scalable to larger numbers of individuals. With these improvements, this method can be used for gene identification as applied to real datasets that contain the exome and expression information for patients and healthy individuals. Consequently, we expect to be able to identify novel sets of genes associated with complex human diseases. This would allow for a better understanding of underlying disease mechanisms, which could later be used to design better drugs for treatment.

References

[1] Anders, Simon and Huber, Wolfgang (2010). Differential expression analysis for sequence count data. , 11(10), R106.

[2] Mezlini, A. M. and Fuligni, F. and Shlien, A. and Goldenberg, A.(2015). Combining exome and gene expression datasets in one graphical model of disease to empower the discovery of disease mechanisms.

[3] Frey, Brendan J and Kschischang, Frank R and Loeliger, Hans-Andrea and Wiberg, Niclas (1997, September). Factor graphs and algorithms.. In Proceedings of the Annual Allerton Conference on Communication Control and Computing Vol. 35, pp. 666-680). UNIVERSITY OF ILLINOIS.

[4] Andres, B. and Beier T. and Kappes, J. H.(2012). : A C++ Library for Discrete Graphical Models.

[5] Tarlow, Daniel and Swersky, Kevin and Zemel, Richard S and Adams, Ryan P and Frey, Brendan J (2012). Fast Exact Inference for Recursive Cardinality Models .