Publications

2013
Gokcumen O, Tischler V, Tica J, Zhu Q, Iskow RC, Lee E, Fritz MH-Y, Langdon A, Stütz AM, Pavlidis P, Benes V, Mills RE, Park PJ, Lee C, Korbel JO. Primate genome architecture influences structural variation mechanisms and functional consequences. Proc Natl Acad Sci U S A 2013;110(39):15764-9.Abstract
Although nucleotide resolution maps of genomic structural variants (SVs) have provided insights into the origin and impact of phenotypic diversity in humans, comparable maps in nonhuman primates have thus far been lacking. Using massively parallel DNA sequencing, we constructed fine-resolution genomic structural variation maps in five chimpanzees, five orang-utans, and five rhesus macaques. The SV maps, which are comprised of thousands of deletions, duplications, and mobile element insertions, revealed a high activity of retrotransposition in macaques compared with great apes. By comparison, nonallelic homologous recombination is specifically active in the great apes, which is correlated with architectural differences between the genomes of great apes and macaque. Transcriptome analyses across nonhuman primates and humans revealed effects of species-specific whole-gene duplication on gene expression. We identified 13 gene duplications coinciding with the species-specific gain of tissue-specific gene expression in keeping with a role of gene duplication in the promotion of diversification and the acquisition of unique functions. Differences in the present day activity of SV formation mechanisms that our study revealed may contribute to ongoing diversification and adaptation of great ape and Old World monkey lineages.
Brennan CW, Verhaak RGW, McKenna A, Campos B, Noushmehr H, Salama SR, Zheng S, Chakravarty D, Sanborn ZJ, Berman SH, Beroukhim R, Bernard B, Wu C-J, Genovese G, Shmulevich I, Barnholtz-Sloan J, Zou L, Vegesna R, Shukla SA, Ciriello G, Yung WK, Zhang W, Sougnez C, Mikkelsen T, Aldape K, Bigner DD, Van Meir EG, Prados M, Sloan A, Black KL, Eschbacher J, Finocchiaro G, Friedman W, Andrews DW, Guha A, Iacocca M, O'Neill BP, Foltz G, Myers J, Weisenberger DJ, Penny R, Kucherlapati R, Perou CM, Hayes ND, Gibbs R, Marra M, Mills GB, Lander E, Spellman P, Wilson R, Sander C, Weinstein J, Meyerson M, Gabriel S, Laird PW, Haussler D, Getz G, Chin L, TCGA Research Network (incl. Lee E). The somatic genomic landscape of glioblastoma. Cell 2013;155(2):462-77.Abstract
We describe the landscape of somatic genomic alterations based on multidimensional and comprehensive characterization of more than 500 glioblastoma tumors (GBMs). We identify several novel mutated genes as well as complex rearrangements of signature receptors, including EGFR and PDGFRA. TERT promoter mutations are shown to correlate with elevated mRNA expression, supporting a role in telomerase reactivation. Correlative analyses confirm that the survival advantage of the proneural subtype is conferred by the G-CIMP phenotype, and MGMT DNA methylation may be a predictive biomarker for treatment response only in classical subtype GBM. Integrative analysis of genomic and proteomic profiles challenges the notion of therapeutic inhibition of a pathway as an alternative to inhibition of the target itself. These data will facilitate the discovery of therapeutic and diagnostic target candidates, the validation of research and clinical observations and the generation of unanticipated hypotheses that can advance our molecular understanding of this lethal cancer.
2012
Cancer Genome Atlas Research Network (incl. Lee E). Comprehensive genomic characterization of squamous cell lung cancers. Nature 2012;489(7417):519-25.Abstract
Lung squamous cell carcinoma is a common type of lung cancer, causing approximately 400,000 deaths per year worldwide. Genomic alterations in squamous cell lung cancers have not been comprehensively characterized, and no molecularly targeted agents have been specifically developed for its treatment. As part of The Cancer Genome Atlas, here we profile 178 lung squamous cell carcinomas to provide a comprehensive landscape of genomic and epigenomic alterations. We show that the tumour type is characterized by complex genomic alterations, with a mean of 360 exonic mutations, 165 genomic rearrangements, and 323 segments of copy number alteration per tumour. We find statistically recurrent mutations in 11 genes, including mutation of TP53 in nearly all specimens. Previously unreported loss-of-function mutations are seen in the HLA-A class I major histocompatibility gene. Significantly altered pathways included NFE2L2 and KEAP1 in 34%, squamous differentiation genes in 44%, phosphatidylinositol-3-OH kinase pathway genes in 47%, and CDKN2A and RB1 in 72% of tumours. We identified a potential therapeutic target in most tumours, offering new avenues of investigation for the treatment of squamous cell lung cancers.
Cancer Genome Atlas Network (incl. Lee E). Comprehensive molecular characterization of human colon and rectal cancer. Nature 2012;487(7407):330-7.Abstract
To characterize somatic alterations in colorectal carcinoma, we conducted a genome-scale analysis of 276 samples, analysing exome sequence, DNA copy number, promoter methylation and messenger RNA and microRNA expression. A subset of these samples (97) underwent low-depth-of-coverage whole-genome sequencing. In total, 16% of colorectal carcinomas were found to be hypermutated: three-quarters of these had the expected high microsatellite instability, usually with hypermethylation and MLH1 silencing, and one-quarter had somatic mismatch-repair gene and polymerase ε (POLE) mutations. Excluding the hypermutated cancers, colon and rectum cancers were found to have considerably similar patterns of genomic alteration. Twenty-four genes were significantly mutated, and in addition to the expected APC, TP53, SMAD4, PIK3CA and KRAS mutations, we found frequent mutations in ARID1A, SOX9 and FAM123B. Recurrent copy-number alterations include potentially drug-targetable amplifications of ERBB2 and newly discovered amplification of IGF2. Recurrent chromosomal translocations include the fusion of NAV2 and WNT pathway member TCF7L1. Integrative analyses suggest new markers for aggressive colorectal carcinoma and an important role for MYC-directed transcriptional activation and repression.
Lee E, Iskow R, Yang L, Gokcumen O, Haseley P, Luquette LJ, Lohr JG, Harris CC, Ding L, Wilson RK, Wheeler DA, Gibbs RA, Kucherlapati R, Lee C, Kharchenko PV**, Park PJ**. Landscape of somatic retrotransposition in human cancers. Science 2012;337(6097):967-71.Abstract
Transposable elements (TEs) are abundant in the human genome, and some are capable of generating new insertions through RNA intermediates. In cancer, the disruption of cellular mechanisms that normally suppress TE activity may facilitate mutagenic retrotranspositions. We performed single-nucleotide resolution analysis of TE insertions in 43 high-coverage whole-genome sequencing data sets from five cancer types. We identified 194 high-confidence somatic TE insertions, as well as thousands of polymorphic TE insertions in matched normal genomes. Somatic insertions were present in epithelial tumors but not in blood or brain cancers. Somatic L1 insertions tend to occur in genes that are commonly mutated in cancer, disrupt the expression of the target genes, and are biased toward regions of cancer-specific DNA hypomethylation, highlighting their potential impact in tumorigenesis.
Evrony GD*, Cai X*, Lee E, Hills BL, Elhosary PC, Lehmann HS, Parker JJ, Atabay KD, Gilmore EC, Poduri A, Park PJ, Walsh CA. Single-neuron sequencing analysis of L1 retrotransposition and somatic mutation in the human brain. Cell 2012;151(3):483-96.Abstract
A major unanswered question in neuroscience is whether there exists genomic variability between individual neurons of the brain, contributing to functional diversity or to an unexplained burden of neurological disease. To address this question, we developed a method to amplify genomes of single neurons from human brains. Because recent reports suggest frequent LINE-1 (L1) retrotransposition in human brains, we performed genome-wide L1 insertion profiling of 300 single neurons from cerebral cortex and caudate nucleus of three normal individuals, recovering >80% of germline insertions from single neurons. While we find somatic L1 insertions, we estimate <0.6 unique somatic insertions per neuron, and most neurons lack detectable somatic insertions, suggesting that L1 is not a major generator of neuronal diversity in cortex and caudate. We then genotyped single cortical cells to characterize the mosaicism of a somatic AKT3 mutation identified in a child with hemimegalencephaly. Single-neuron sequencing allows systematic assessment of genomic diversity in the human brain.
2011
Xi R, Hadjipanayis AG, Luquette LJ, Kim T-M, Lee E, Zhang J, Johnson MD, Muzny DM, Wheeler DA, Gibbs RA, Kucherlapati R, Park PJ. Copy number variation detection in whole-genome sequencing data using the Bayesian information criterion. Proc Natl Acad Sci U S A 2011;108(46):E1128-36.Abstract
DNA copy number variations (CNVs) play an important role in the pathogenesis and progression of cancer and confer susceptibility to a variety of human disorders. Array comparative genomic hybridization has been used widely to identify CNVs genome wide, but the next-generation sequencing technology provides an opportunity to characterize CNVs genome wide with unprecedented resolution. In this study, we developed an algorithm to detect CNVs from whole-genome sequencing data and applied it to a newly sequenced glioblastoma genome with a matched control. This read-depth algorithm, called BIC-seq, can accurately and efficiently identify CNVs via minimizing the Bayesian information criterion. Using BIC-seq, we identified hundreds of CNVs as small as 40 bp in the cancer genome sequenced at 10× coverage, whereas we could only detect large CNVs (> 15 kb) in the array comparative genomic hybridization profiles for the same genome. Eighty percent (14/16) of the small variants tested (110 bp to 14 kb) were experimentally validated by quantitative PCR, demonstrating high sensitivity and true positive rate of the algorithm. We also extended the algorithm to detect recurrent CNVs in multiple samples as well as deriving error bars for breakpoints using a Gibbs sampling approach. We propose this statistical approach as a principled yet practical and efficient method to estimate CNVs in whole-genome sequencing data.
Cancer Genome Atlas Research Network (incl. Lee E). Integrated genomic analyses of ovarian carcinoma. Nature 2011;474(7353):609-15.Abstract
A catalogue of molecular aberrations that cause ovarian cancer is critical for developing and deploying therapies that will improve patients' lives. The Cancer Genome Atlas project has analysed messenger RNA expression, microRNA expression, promoter methylation and DNA copy number in 489 high-grade serous ovarian adenocarcinomas and the DNA sequences of exons from coding genes in 316 of these tumours. Here we report that high-grade serous ovarian cancer is characterized by TP53 mutations in almost all tumours (96%); low prevalence but statistically recurrent somatic mutations in nine further genes including NF1, BRCA1, BRCA2, RB1 and CDK12; 113 significant focal DNA copy number aberrations; and promoter methylation events involving 168 genes. Analyses delineated four ovarian cancer transcriptional subtypes, three microRNA subtypes, four promoter methylation subtypes and a transcriptional signature associated with survival duration, and shed new light on the impact that tumours with BRCA1/2 (BRCA1 or BRCA2) and CCNE1 aberrations have on survival. Pathway analyses suggested that homologous recombination is defective in about half of the tumours analysed, and that NOTCH and FOXM1 signalling are involved in serous ovarian cancer pathophysiology.
Lee S, Lee E, Lee KH, Lee D. Predicting disease phenotypes based on the molecular networks with condition-responsive correlation. Int J Data Min Bioinform 2011;5(2):131-42.Abstract
Network-based methods using molecular interaction networks integrated with gene expression profiles have been proposed to solve problems, which arose from smaller number of samples compared with the large number of predictors. However, previous network-based methods, which have focused only on expression levels of proteins, nodes in the network through the identification of condition-responsive interactions. We propose a novel network-based classification, which focuses on both nodes with discriminative expression levels and edges with Condition-Responsive Correlations (CRCs) across two phenotypes. We found that modules with condition-responsive interactions provide candidate molecular models for diseases and show improved performances compared conventional gene-centric classification methods.
Paik H, Lee E, Park I, Kim J, Lee D. Prediction of cancer prognosis with the genetic basis of transcriptional variations. Genomics 2011;97(6):350-7.Abstract
Phenotypes of diseases, including prognosis, are likely to have complex etiologies and be derived from interactive mechanisms, including genetic and protein interactions. Many computational methods have been used to predict survival outcomes without explicitly identifying interactive effects, such as the genetic basis for transcriptional variations. We have therefore proposed a classification method based on the interaction between genotype and transcriptional expression features (CORE-F). This method considers the overall "genetic architecture," referring to genetically based transcriptional alterations that influence prognosis. In comparing the performance of CORE-F with the ensemble tree, the best-performing method predicting patient survival, we found that CORE-F outperformed the ensemble tree (mean AUC, 0.85 vs. 0.72). Moreover, the trained associations in the CORE-F successfully identified the genetic mechanisms underlying survival outcomes at the interaction-network level.
2010
Jung J, Ryu T, Hwang Y, Lee E, Lee D. Prediction of extracellular matrix proteins based on distinctive sequence and domain characteristics. J Comput Biol 2010;17(1):97-105.Abstract
Extracellular matrix (ECM) proteins are secreted to the exterior of the cell, and function as mediators between resident cells and the external environment. These proteins not only support cellular structure but also participate in diverse processes, including growth, hormonal response, homeostasis, and disease progression. Despite their importance, current knowledge of the number and functions of ECM proteins is limited. Here, we propose a computational method to predict ECM proteins. Specific features, such as ECM domain score and repetitive residues, were utilized for prediction. Based on previously employed and newly generated features, discriminatory characteristics for ECM protein categorization were determined, which significantly improved the performance of Random Forest and support vector machine (SVM) classification. We additionally predicted novel ECM proteins from non-annotated human proteins, validated with gene ontology and earlier literature. Our novel prediction method is available at biosoft.kaist.ac.kr/ecm.
Paik H, Lee E, Lee D. Relationships between genetic polymorphisms and transcriptional profiles for outcome prediction in anticancer agent treatment. BMB Rep 2010;43(12):836-41.Abstract
In the era of personal genomics, predicting the individual response to drug-treatment is a challenge of biomedical research. The aim of this study was to validate whether interaction information between genetic and transcriptional signatures are promising features to predict a drug response. Because drug resistance/susceptibilities result from the complex associations of genetic and transcriptional activities, we predicted the inter-relationships between genetic and transcriptional signatures. With this concept, captured genetic polymorphisms and transcriptional profiles were prepared in cancer samples. By splitting ninety-nine samples into a trial set (n = 30) and a test set (n = 69), the outperformance of relationship-focused model (0.84 of area under the curve in trial set, P = 2.90 x 10⁻⁴) was presented in the trial set and validated in the test set, respectively. The prediction results of modeling show that considering the relationships between genetic and transcriptional features is an effective approach to determine outcome predictions of drug-treatment.
2009
Lee E, Jung H, Radivojac P, Kim J-W, Lee D. Analysis of AML genes in dysregulated molecular networks. BMC Bioinformatics 2009;10 Suppl 9:S2.Abstract
BACKGROUND: Identifying disease causing genes and understanding their molecular mechanisms are essential to developing effective therapeutics. Thus, several computational methods have been proposed to prioritize candidate disease genes by integrating different data types, including sequence information, biomedical literature, and pathway information. Recently, molecular interaction networks have been incorporated to predict disease genes, but most of those methods do not utilize invaluable disease-specific information available in mRNA expression profiles of patient samples. RESULTS: Through the integration of protein-protein interaction networks and gene expression profiles of acute myeloid leukemia (AML) patients, we identified subnetworks of interacting proteins dysregulated in AML and characterized known mutation genes causally implicated to AML embedded in the subnetworks. The analysis shows that the set of extracted subnetworks is a reservoir rich in AML genes reflecting key leukemogenic processes such as myeloid differentiation. CONCLUSION: We showed that the integrative approach both utilizing gene expression profiles and molecular networks could identify AML causing genes most of which were not detectable with gene expression analysis alone due to the minor changes in mRNA level.
Won H-H, Park I, Lee E, Kim J-W, Lee D. Comparative analysis of the JAK/STAT signaling through erythropoietin receptor and thrombopoietin receptor using a systems approach. BMC Bioinformatics 2009;10 Suppl 1:S53.Abstract
BACKGROUND: The Janus kinase-signal transducer and activator of transcription (JAK/STAT) pathway is one of the most important targets for myeloproliferative disorder (MPD). Although several efforts toward modeling the pathway using systems biology have been successful, the pathway was not fully investigated in regard to understanding pathological context and to model receptor kinetics and mutation effects. RESULTS: We have performed modeling and simulation studies of the JAK/STAT pathway, including the kinetics of two associated receptors (the erythropoietin receptor and thrombopoietin receptor) with the wild type and a recently reported mutation (JAK2V617F) of the JAK2 protein. CONCLUSION: We found that the different kinetics of those two receptors might be important factors that affect the sensitivity of JAK/STAT signaling to the mutation effect. In addition, our simulation results support clinically observed pathological differences between the two subtypes of MPD with respect to the JAK2V617F mutation.
Jung H, Lee E, Kim J-W, Lee D. Pathway level analysis by augmenting activities of transcription factor target genes. IET Syst Biol 2009;3(6):534-42.Abstract
Many approaches to discovering significant pathways in gene expression profiles have been developed to facilitate biological interpretation and hypothesis generation. In this work, the authors propose a pathway identification scheme integrating the activity of pathway member genes with that of target genes of transcription factors (TFs) in the same pathway by the weighted Z-method. The authors evaluated the integrative scoring scheme in gene expression profiles of essential thrombocythemia patients with JAK2V617F mutation status, primary breast tumour samples with the status of metastasis occurrence, two independent lung cancer expression profiles with their prognosis, and found that our approach identified cancer-type-specific pathways better than gene set enrichment analysis (GSEA) and Tian's method using the original pathways [pathways that have TFs from database] and the extended pathways (including target genes of TFs of the original pathways). The success of our scheme implicates that adding information of transcriptional regulation is better way of utilising mRNA measurements for estimating differential activities of pathways from gene expression profiles more exactly.
2008
Lee E*, Chuang H-Y*, Kim J-W, Ideker T**, Lee D**. Inferring pathway activity toward precise disease classification [Internet]. PLoS Comput Biol 2008;4(11):e1000217. Publisher's VersionAbstract
The advent of microarray technology has made it possible to classify disease states based on gene expression profiles of patients. Typically, marker genes are selected by measuring the power of their expression profiles to discriminate among patients of different disease states. However, expression-based classification can be challenging in complex diseases due to factors such as cellular heterogeneity within a tissue sample and genetic heterogeneity across patients. A promising technique for coping with these challenges is to incorporate pathway information into the disease classification procedure in order to classify disease based on the activity of entire signaling pathways or protein complexes rather than on the expression levels of individual genes or proteins. We propose a new classification method based on pathway activities inferred for each patient. For each pathway, an activity level is summarized from the gene expression levels of its condition-responsive genes (CORGs), defined as the subset of genes in the pathway whose combined expression delivers optimal discriminative power for the disease phenotype. We show that classifiers using pathway activity achieve better performance than classifiers based on individual gene expression, for both simple and complex case-control studies including differentiation of perturbed from non-perturbed cells and subtyping of several different kinds of cancer. Moreover, the new method outperforms several previous approaches that use a static (i.e., non-conditional) definition of pathways. Within a pathway, the identified CORGs may facilitate the development of better diagnostic markers and the discovery of core alterations in human disease.
2007
Chuang H-Y*, Lee E*, Liu Y-T, Lee D, Ideker T. Network-based classification of breast cancer metastasis [Internet]. Mol Syst Biol 2007;3:140. Publisher's VersionAbstract
Mapping the pathways that give rise to metastasis is one of the key challenges of breast cancer research. Recently, several large-scale studies have shed light on this problem through analysis of gene expression profiles to identify markers correlated with metastasis. Here, we apply a protein-network-based approach that identifies markers not as individual genes but as subnetworks extracted from protein interaction databases. The resulting subnetworks provide novel hypotheses for pathways involved in tumor progression. Although genes with known breast cancer mutations are typically not detected through analysis of differential expression, they play a central role in the protein network by interconnecting many differentially expressed genes. We find that the subnetwork markers are more reproducible than individual marker genes selected without network information, and that they achieve higher accuracy in the classification of metastatic versus non-metastatic tumors.

Pages