A pair of computational biologists from UCLA and MIT developed a method that can fill in major gaps in large-scale epigenomics datasets, and used it to refine and expand the most comprehensive map of the human epigenome. The research was published online in Nature Biotechnology, by Jason Ernst of UCLA, and Manolis Kellis of MIT, as part of their longstanding collaboration in the field of epigenomics.

All the instructions necessary to make a complete human are written in “the book of life” –the human genome, which is the totality of our DNA. Every cell has a copy of the same genome, but every cell switches different genes on, or off, leading to the diversity of functions of brain, heart, liver, skin, or blood cells in our bodies.

This choreography is accomplished in part by chemical modifications known as epigenetic marks, which are placed on the DNA or its packaging, and serve as Post-It notes on the book of life. In each cell type, different genomic locations are modified, marking up the chapters of the book that the cell will need to accomplish its functions. Different epigenetic mark combinations serve as Post-It notes of different colors, used to mark up each type of function, such as genes, control switches, or repressed regions.

The National Institutes of Health-funded Roadmap Epigenomics project published the most comprehensive view of the human epigenome on February 18 in Nature based on thousands of experiments across more than 100 tissues and cell types. Ernst, a joint first author of the consortium paper, and Kellis, the senior author, were part of the integrative analysis team, and developed several of the algorithms used for interpreting the datasets.

Ernst and Kellis realized a common challenge that many data generation projects face when seeking to integrate large-scale datasets is missing data.

“When considering many epigenomic marks across many cell types, some experiments are undoubtedly missing, due to sample availability or budgetary limitations”, said Ernst.” To cope, researchers are often faced with the difficult choice of utilizing a subset of the marks, or a subset of the cell types.”

To remedy this problem, the pair developed a computational method that can predict missing experiments by exploiting the correlations present in existing datasets. This allowed them to complete the epigenomic maps generated by the consortium, generating a matrix of 34 epigenomic marks in 127 cell or tissue types, by predicting over 4,000 datasets, of which nearly three quarters were never observed.

The researchers utilized more than 1,000 predicted maps that were also experimentally observed to compare observed against predicted data. They found, to their surprise, evidence that the imputed data, a term used in statistics to describe predicted values for missing data based on other available information, was often of overall higher quality.

“Imputed data showed higher concordance with gene annotations, conserved regions, and disease-associated genetic variants, and better recovery of tissue relationships and tissue-restricted elements”, said Kellis. “We believe that the increased quality comes by leveraging dozens of experiments in predicting any one of them, leading to greater resilience to experimental noise.”

The researchers estimated that with large numbers of replicates, observed data would eventually match or surpass the performance of imputed data.

“In practice, a large number of replicates is often cost-prohibitive, or simply not feasible due to limited biological material”, said Ernst.” In those cases, imputation achieves increased robustness by leveraging correlations between related marks and related samples.”

“Large-scale epigenome imputation can be a game-changer in planning and carrying out large projects,” said Kellis. “By prioritizing the most essential experiments, and relying on imputation for missing experiments, or when some experiments inevitably fail. We would certainly prefer having high-quality experimental data for all marks, in all cell types, in many replicates, but when this is not possible, imputed data is a clear choice according to our metrics.”

So should scientists stop generating data and simply rely on imputation? “Of course not!”, Ernst replied. “It is only by leveraging the vast experimentally-profiled datasets of the consortium that we could build these imputed maps. Without large-scale experimental mapping, the ability to impute new datasets will be greatly reduced.”

The authors generated genome-wide maps that enable biologists to explore any region of the human genome, even for experiments that have not yet been carried out.

“It sounds like a scene from ‘Minority Report’, but it’s simply the sign of a maturing field”, Kellis said. “In genetics, imputation of missing data is routine, with increasingly accurate genetic reference panels. Our results suggest that we have perhaps reached that inflection point in epigenomics, where the body of experimental datasets is sufficiently large to enable highly-accurate imputation.”

“Our initial results suggest that imputed datasets will have applications in studies of gene regulation and of human disease, and opens up many directions for future work,” said Ernst. “As the number of experimentally-profiled datasets grows, and as scientists become more comfortable with the concept, we expect epigenome imputation to play an increasingly central role in future projects.”

The paper was published online Feb. 18 in Nature Biotechnology doi:10.1038/nbt.3157.

Ernst is an assistant professor in the Biological Chemistry Department in the David Geffen School of Medicine and in the Computer Science Department in the Henry Samueli School of Engineering and Applied Science. He is a member of the Interdepartmental Bioinformatics Program, the Institute for Quantitative and Computational Biosciences; the Eli & Edythe Broad Center of Regenerative Medicine & Stem Cell Research Center; the Jonsson Comprehensive Cancer Center; and the Molecular Biology Institute at UCLA.

Kellis is a professor of computer science at MIT and is a member of its Computer Science and Artificial Intelligence Laboratory, and an institute member of the Broad Institute.

The research was funded by the National Science Foundation through a CAREER Award, and by an Alfred P. Sloan Foundation Fellowship, both to Ernst; and by two National Human Genome Research Institute grants from the National Institutes of Health to Kellis.

Related epigenomics work by Ernst and Kellis

Discovery and characterization of chromatin states for systematic annotation of the human genome, Ernst and Kellis, Nature Biotechnology 28:817-825, July 25, 2010.  doi:10.1038/nbt.1662.

Mapping and analysis of chromatin state dynamics in nine human cell types, Ernst et al. Nature, 473:43-49, May 5, 2011. doi:10.1038/nature0990. In collaboration with Dr. Bradley Bernstein and colleagues.

ChromHMM: automating chromatin-state discovery and characterization, Ernst and Kellis, Nature Methods 9:215-216, February 28, 2012. doi:10.1038/nmeth.1906

Interplay between chromatin state, regulator binding, and regulatory motifs in six human cell types, Ernst and Kellis, Genome Research 23:1142-1154, April 17, 2013. doi: 10.1101/gr.144840.112