Victor Chen and Nathan Nguyen

Machine Learning and the Future of Epigenetic Research

The technological advancements of the last several decades has far outpaced that of centuries prior, nothing more so than the introduction of the computer which has revolutionized fields the world over, including scientific research. From this innovation, tools including the varieties of software programs, modeling of phenomena, automation of repetitive tasks, and ever more capabilities have sprouted. One of the more recent additions to the researcher’s toolkit has been introduced through the growing field of machine learning. Now it is possible to perform concrete data analysis on what previously would have been impossibly large datasets, that would have required painstakingly large amounts of labor. Unlike algorithms that may simply attempt to try and categorize data according to a strict and inadequate criterion, machine learning models can identify patterns and form novel organizations that in the past only humans could do. In many cases, it can actually surpass human limitations and see patterns that could be missed by professionals. Within machine learning though, there are as many models as there are things to be studied. In the biological sciences, some models are far more useful than others, and this holds even more true as we narrow down to the fields of epigenetics, where the traits of organisms are expressed through factors other than genetic code, and further into neuroepigenetics, where it is focused more specifically on the brain. The models that will be discussed are the ones most pertinent to epigenetics in general, but examples of methodologies used specifically in neuroepigenetic studies will also be evaluated.

In epigenetics, massive datasets are a common occurrence. Underlying genetics is a vast field of information that includes genetic variations, which are responsible for the different traits like from height to skin color and from predisposition for diseases to immunity to others. Within the population, further confounding variables affect trait expression, such as environment, age, or life events. Attempts in epigenetics to record just a small fraction of profiles, finding differences and similarities between the epigenetics of individuals, can allow people to see the kinds of profiles that may cause individuals to be more likely to have certain trait expressions. With so much data, in a world with a limited supply of underfunded and overworked researchers, it becomes unreasonable to manually analyze the troves of data, resulting in either expensive amounts of time wasted or exclusion of possibly valuable information. To resolve this issue, it is possible to utilize machine learning models and allow our computers to churn through all the variables so that the researcher can interpret them, similar to an accountant that organizes incomes and expenses so that a manager can see the big picture without needing to sort through each individual transaction. The two types of machine learning models most relevant are supervised and unsupervised learning. The difference between the two is whether or not they are training on data that is labeled or unlabeled. Labeled data is data that has been manually categorized, for example a picture of an animal can be labeled by a person to be a cat, allowing the model to refer to the label when learning. Supervised learning involves the training of a model using this type of labeled data, allowing the model to constantly review its own predictions while training. Unsupervised on the other hand uses unlabeled data, forcing the model to try and determine its own patterns and features from the dataset. Both attempt to learn from patterns and relations within their training dataset, but supervised learning models are given ground truths that can help the model learn better. For example, a data set containing methylation profiles of various individuals in a population and associated diseases. These methylation profiles can be defined by CpG sites, which are the regions in DNA that can be methylated to cause changes in genetic expression. In this case, every individual CpG site would be a feature, which could be methylated or unmethylated. Every individual would therefore have hundreds of thousands of input features that could be analyzed and correlated to expression of diseases. Disease would function as the labelled variable, allowing the model to predict relationships between methylation and disease. In comparison, in situations where relationships are uncertain, unsupervised learning can derive patterns even without labels. A recent example of unsupervised learning has been its usage for principal component analysis (PCA). Using training data containing methylation data, in conjunction with protein expression, it is possible to predict protein expression given a specific methylation profile or determine correlations between the two. In this case, the model does not require data to be labelled, allowing it to determine relationships between methylation and protein expression on its own.

Examples of supervised learning models popular for biological research include support-vector machines, linear regression, and decision trees. Support-vector machines are used in classification problems where features may be high. An example of support-vector machines usage would be in a paper by Xu et al. that attempted to identify human enhancers by analyzing data of low methylated regions (regions with characteristically low methylation of CpG sites)1. Since not all low methylated regions are enhancers, the researchers used a support-vector machine model to classify the regions into one of three categories: reliable positive, likely positive, and likely negative. This allowed them to sort through hundreds of thousands of regions and find over a hundred thousand possible enhancer regions.

Linear regression models are a very simple example of supervised learning, but this also makes it very practical and common for a variety of purposes. As the name implies, a linear regression model attempts to form a linear relationship between dependent variables and independent inputs. For example, the square footage of a house would be an independent input, and the model would then return the dependent variable, which could be the price of the house. Training such a model basically involves tuning the coefficients of the linear equation to best map the relationship between the variables. Doing so involves an algorithm known as gradient descent which iteratively attempts to find the local minimum of the function. This basically means that the line is being moved and fit to the data points until the error, or sum of distances from the data points to the line, is at a minimum. This simple model can be used to quickly predict a variety of linear relationships, such as perhaps number of levels of a transcription factor, which is involved in DNA expression, to the amount of mRNA or protein ultimately synthesized, which can ultimately affect expressions of certain traits. For such a scenario, the amount of protein ultimately synthesized would be a continuous dependent variable that can be plotted on the y-axis, and the number of transcription factors would be the independent variable on the x-axis. A related idea is the logistic regression model, which is used more commonly for classification purposes rather than regression problems. A logistic regression model can only return values between 0 and 1, representing the probability that something is in one category or another. An example of logistic regression can be found in a recent paper by Lian et al., that used a logistic regression model to identify stemness (stem cell molecular processes) in medulloblastoma and score them. This stemness was quantified by the logistic regression model using DNA methylation and mRNA expression signatures2.

Finally, decision tree learning uses a decision tree to determine an input object’s value. The fact that each decision follows a discrete branch makes it special among most other machine learning models because it means it is interpretable. The ability to visualize the mechanisms behind the input makes it easy to digest and understand output, as well as diagnose any issues the model may have. A paper by Celli et al. demonstrates the practicality of decision tree models by using one to classify DNA methylation data3. Due to the ambiguity of DNA methylation and cancer formation, because DNA methylation could inhibit tumor-suppressor genes (genes that if inhibited could cause cancer) or oncogenes (genes that if activated could cause cancer), and the massive amounts of DNA data available with the rise of technologies like next-generation sequencing, the researchers were interested in developing a new way to classify massive DNA methylation datasets for use in identifying cancer markers. Using decision trees, they were able to develop a method for classifications of hundreds of thousands of CpG features.

While there are many more types of supervised learning, it is also important to discuss unsupervised learning models. As previously mentioned, supervised learning models require data to be labeled when training, which means that the model is able uses preformed structures and categories to determine what to classify new examples into. Unsupervised learning models do not have these labels, requiring the model to determine patterns and relationships on its own in order to perform analysis on any test data. The most popular example of this type of machine learning is known as clustering, which, as the name implies, clusters data points into similar categories. Some of these groups can be more or less similar to one group than another, but the end result is grouping of data that have no inherent labeling. Principal component analysis, mentioned earlier, is an example of a model using clustering to determine relationships and thereby group data points. Another even simpler and popular clustering method is known as k-means clustering. The k in k-means clustering refers to the number of clusters that the data is to be grouped into. A k number of distinct data points are then selected as the initial point for each cluster, then every other data point is assigned based on which of the initial data points they are closest to. Once all points are assigned, the means of each cluster are calculated, and data points are reassigned if they end up closer to one cluster mean than another, eventually creating final clusters. This process is repeated over and over on the same dataset, but with the initial k number of data points that start each cluster being different, resulting in different end results. The best cluster will end up being the one with the least amount of variation among its clusters. An example of this method was used by Corces et al. to identify tumor-suppressor and oncogenes by analyzing 25,000 ATAC-seq, which were assays that measured chromatin accessibility throughout the genome4. The ability to access chromatin is controlled by epigenetics and regulates the ability for DNA to be expressed. These models demonstrate how data analysis has progressed leaps and bounds, beyond limitations imposed by restrictive algorithms of the past and able to sort through the immense troves of data we are constantly generating.

As technology in fields like DNA sequencing, genome-wide association studies, and other sources of massive data sets continue to advance, efficient and accurate methods of analyzing the millions of data points we generate become increasingly critical. The advancement of machine learning models and hardware along with it, such as cloud computing, will become ubiquitous if they have not already. While machine learning will not be replacing the jobs of researchers anytime soon, they may soon become an indispensable tool for any sort of data analysis or research in general, able to applied adaptably to nearly any situation and provide novel paths for future research as well. How these developments will change the field remains to be seen, but massive ground has already been gained in just the last few years, with more to come.


  1. Xu J, Hu H, Dai Y (2016) LMethyR-SVM: Predict Human Enhancers Using Low Methylated Regions based on Weighted Support Vector Machines. PLOS ONE 11(9): e0163491.
  2. Lian H, Han YP, Zhang YC, et al. Integrative analysis of gene expression and DNA methylation through one‐class logistic regression machine learning identifies stemness features in medulloblastoma. Molecular Oncology. 2019;13(10):2227-2245. doi:10.1002/1878-0261.12557.
  3. Celli F, Cumbo F, Weitschek E. Classification of Large DNA Methylation Datasets for Identifying Cancer Drivers. Big Data Research. 2018;13:21-28. doi:10.1016/j.bdr.2018.02.005
  4. Corces MR, Granja JM, Shams S, et al. The chromatin accessibility landscape of primary human cancers. Science. 2018;362(6413):eaav1898. doi:10.1126/science.aav1898