ENCODE Turns Human Genome From Sequence to Machine
Adapted from a series that originally appeared on the Alzheimer Research Forum.
This is Part 1 of a two-part story. See also Part 2.
24 September 2012. When the first human genome sequence was finished in 2003, it quickly became clear that its seemingly unending stream of letters was not enough to understand what makes people tick. All the moving parts that bring the DNA code to life needed to be understood as well. To address this, researchers launched the Encyclopedia Of DNA Elements (ENCODE) in 2003. More than 440 scientists collaborated to identify functional parts of the genome—those regions that regulate how, when, and where different genes are turned on or off. This month, the consortium released its latest data in a fusillade of papers, including six in the September 6 Nature. Together, the scientists found that about 80 percent of the genome, much of which was previously considered “junk” DNA, has at least one biochemical function in some cells. Though the scientists have not yet examined every human cell type, this latest effort widens their window into the intricacies of the genetic code. The data could help researchers understand how genetic variants found outside of genes can alter risk for disease, including neurodegenerative disorders such as Alzheimer’s (see Part 2). "This short-circuits a lot of work that you'd have to do to figure out signals from a genetic study," said Gerard Schellenberg, University of Pennsylvania School of Medicine, Pennsylvania.
ENCODE is funded by the National Human Genome Research Institute (NHGRI). Its public consortium of scientists, who hail from more than 30 institutions around the world, has so far examined 147 cell types in about 1,640 separate datasets to identify functional elements in human DNA. A Nature special feature helps navigate the findings, allowing users to search by topic thread or by individual paper. In more than 30 published papers, scientists map the genome by regions of transcription, transcription factor binding, histone modification, and methylation. A five-minute Nature YouTube video explains the rationale behind ENCODE.
"At the start of this project, most people thought that if genes make up 2 percent of the genome, maybe another 3 or 4 percent made up the instruction that controls them," said John Stamatoyannopoulos, University of Washington School of Medicine, Seattle, an ENCODE project leader. "It has been surprising just how much of the genome actually is involved."
One interesting map for those studying Alzheimer's disease and other neurodegenerative disorders pinpoints the genome's regulatory sites. These are the enhancers, promoters, insulators, and silencers (see Thurman et al., 2012). It turns out that almost 2.9 million regulatory regions populate the genome, 10 times more than expected at ENCODE's outset. Only 200,000 are active in any given cell at one time.
The researchers discovered the regulatory elements by identifying DNA sequences that had unwound from their histone spools in preparation to affect transcription elsewhere in the genome. By determining which stretches of DNA were active at the same time as certain promoters, researchers matched 20 percent of regulatory elements to specific genes that were regulated by them. Interestingly, non-coding single-nucleotide polymorphisms that turn up in genomewide association studies (GWAS) are often found in these very regions (see Part 2 of this series). In the Alzheimer’s field, researchers have associated non-coding polymorphisms with disease, but are struggling to interpret how they might alter risk.
"Understanding the genome sequence got us a long way in being able to do genetic studies of disease, but having this map of genome function is important for understanding the actual biology that leads to disease," said Stamatoyannopoulos.
Mapping regulatory elements will also allow researchers to find potentially important genomic regions for disease treatment, including for Alzheimer's disease (AD), said Schellenberg, who noted, for example, that tinkering with the regulation of β-secretase (BACE1)—one of the enzymes that produces Aβ—is one possible way to treat AD. "You can go into ENCODE and at least begin to see the potential regulatory units and networks to target," he said.
Companion papers offer more insight about control of gene transcription. For instance, researchers led by Job Dekker and Amartya Sanyal, both of the University of Massachusetts Medical School in Worcester, have begun to map long-range interactions between enhancers and gene promoters that are quite distant from each other in the genome but may come into contact when DNA is folded and coiled around histones (see Sanyal et al., 2012). They have mapped but 1 percent of the genome, but in doing so they found, among other things, that both enhancers and promoters can interact with multiple long-range partners.
In another study, scientists including Mark Gerstein of Yale University, New Haven, Connecticut, and Michael Snyder of Stanford University, California, have worked out how and where different combinations of 119 transcription factors bind to DNA to regulate gene activity (see Gerstein et al., 2012). Hundreds of other transcription factors remain to be analyzed; even so, this is one step toward understanding how networks of these regulatory proteins function and contribute to human disease, wrote the authors.
One shortcoming of ENCODE is that it did not sample brain tissue, Schellenberg noted. "There are many different kinds of cell types in the brain, many different kinds of neurons, and each probably has its own gene regulatory network." Stamatoyannopoulos said that researchers hope to look at more cell types associated with human disease, such as the brain, in the future.
ENCODE researchers still have a long road ahead before they reach a thorough understanding of how the genome behaves at different stages of development, in different tissues, and within individual cells of a heterogeneous tissue. Phase 3 of the project is dedicated to filling in these blanks. However, firsthand observation may not be necessary or even possible for every conceivable situation, wrote Eran Segal in an accompanying News and Views article. He suggested that when scientists learn enough about the genome, they may be able to predict how it will behave in different circumstances. "We must work towards deriving quantitative models that integrate the relevant protein, RNA, and chromatin components; describe how these components interact with each other; how they bind the genome; and how these binding events regulate transcription," he said. "If successful, such models will be able to predict the genome’s function at times and in settings that have not been directly measured."—Gwyneth Dickey Zakaib.
This is Part 1 of a two-part story. See also Part 2.
ENCODE Project Consortium, Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012 Sep 6;489(7414):57-74. Abstract
Ecker JR, Bickmore WA, Barroso I, Pritchard JK, Gilad Y, Segal E. Genomics: ENCODE explained. Nature. 2012 Sep 6;489(7414):52-5. Abstract
Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, Garg K, John S, Sandstrom R, Bates D, Boatman L, Canfield TK, Diegel M, Dunn D, Ebersol AK, Frum T, Giste E, Johnson AK, Johnson EM, Kutyavin T, Lajoie B, Lee BK, Lee K, London D, Lotakis D, Neph S, Neri F, Nguyen ED, Qu H, Reynolds AP, Roach V, Safi A, Sanchez ME, Sanyal A, Shafer A, Simon JM, Song L, Vong S, Weaver M, Yan Y, Zhang Z, Zhang Z, Lenhard B, Tewari M, Dorschner MO, Hansen RS, Navas PA, Stamatoyannopoulos G, Iyer VR, Lieb JD, Sunyaev SR, Akey JM, Sabo PJ, Kaul R, Furey TS, Dekker J, Crawford GE, Stamatoyannopoulos JA. The accessible chromatin landscape of the human genome. Nature. 2012 Sep 6;489(7414):75-82. Abstract
Sanyal A, Lajoie BR, Jain G, Dekker J. The long-range interaction landscape of gene promoters. Nature. 2012 Sep 6;489(7414):109-13. Abstract
Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, Min R, Alves P, Abyzov A, Addleman N, Bhardwaj N, Boyle AP, Cayting P, Charos A, Chen DZ, Cheng Y, Clarke D, Eastman C, Euskirchen G, Frietze S, Fu Y, Gertz J, Grubert F, Harmanci A, Jain P, Kasowski M, Lacroute P, Leng J, Lian J, Monahan H, O'Geen H, Ouyang Z, Partridge EC, Patacsil D, Pauli F, Raha D, Ramirez L, Reddy TE, Reed B, Shi M, Slifer T, Wang J, Wu L, Yang X, Yip KY, Zilberman-Schapira G, Batzoglou S, Sidow A, Farnham PJ, Myers RM, Weissman SM, Snyder M. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012 Sep 6;489(7414):91-100. Abstract