科学网

 找回密码
  注册

tag 标签: Diva

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

Love's Notation
njnuzhangwei 2013-2-28 16:58
Love's Notation http://web.ics.purdue.edu/~nowack/geos557/lecture8-dir/lecture8.htm PurdueUniversity EAS 557 Introduction to Seismology Robert L. Nowack Lecture 8 Constitutive Relations We have so far not specified the relationship between displacement and forces in the continuum.The relationship between (strain) and (stress) is termed a constitutive relation.Constitutive relations can be {linear/nonlinear}, {time dependent/time independent}, {reversible/nonreversible}. In general terms, the behavior of a solid under stress can be roughly characterized by a stress-strain relation having some, or all, of the following features 1) linear elastic (i.e., Hookean ) = linear function of e ij reversible process (recoverable strain energy) slope of gives elastic constants 2) nonlinear elastic is a nonlinear function of reversible – strain energy recoverable characteristic of soils 3)yield region ~ beyond the yield point or elastic limit, ductile region, plastic flow energy is dissipated unloading leaves permanent offset - grain sliding and rotation - microfracturing This could be accompanied by dilatancy which is an increase in specific volume resulting in a decrease of seismic velocities.In the 1970’s, this was thought to have great potential for predicting earthquakes. strain hardening 4) failure ~ region beyond ultimate strength of the material proceeds catastrophically to failure The condition can be very unstable ranging over a cascade of distance scales Viscoelasticity – This refers to time dependent constitutive relations The Earth primarily acts at short time scales as an elastic body, but at long times, it can flow or creep without obvious faulting.Strain rate is then proportional to applied stress. There are a number of simple models for viscoelastic materials. Viscoelastic Rheological Models 1)Elastic Material where F is a force, is the elastic constant (or spring constant), and u is the displacement.For a continuum, this would be 2)Viscous Material where for example this could represent the pulling ofa plate through a fluid in a dashpot.Then, where F is the applied force, = viscosity, and is the particle velocity.For a continuum, this would be For this case, a constant stress results in a constant strain rate. Units of viscosity are in poise where 1 poise = 1 dyne -s/cm 2 .In SI units, viscosity is in Pascal-sec where 1 poise = 0.1 Pa-s. 3)Maxwell solid where this shows a spring and dashpot in series.Then, The material is elastic at short times and viscous at long times 4) Voight (Kelvin solid) In this model, the spring and dashpot are in parallel. Then, In the frequency domain, this can be written as where is a complex elastic modulus, and and are the Fourier transforms of and . In elastic wave problems, slight dissipation can be modeled using complex elastic constants and the same equation as for the elastic case can be used.This is called the correspondence principle. In the Earth, observations indicating a dissipation loss mechanism, include Attenuation of seismic waves as measured by a “Q” value Damping of the Earth’s Chandler wobble Nonhydrostatic figure of the Earth Uplift and subsidence of land masses ( Fennoscandia )(Hawaii) For seismic wave propagation, linear elasticity works to an excellent degree away from the source region.Seismic attenuation (not including scattering) can also be included using a Q value related to the imaginary part of the elastic constant.Below, we will focus on linear elasticity with real elastic constants. We consider, in the seismic wave propagation context, linear elasticity and infinitesimal strain.Assume a Hooke’s Law relation where are the elastic constants which has 3 4 or 81 components (each index going from 1 to 3). We now apply various constraints on to reduce the number of elastic constants: 1)From the symmetry of and then This reduces the number of independent elastic constants from 81 to 36. 2)From existence of strain energy function, where W = internal energy per unit volume, then by energy considerations of an adiabatic reversible process From linear elasticity , , and W can be written as a quadratic and homogeneous function of by symmetry Thus, The number of independent elastic constants then reduces from 36 components to 21. In Love’s notation or where L = 1,6 and M = 1,6.The relation between C LM and c ijkl is given by L ( ij ) 1 11 2 22 3 33 4 23 or 32 5 13 or 31 6 12 or 21 For example, C 66 = c 1212 = c 2112 = c 1221 For various crystal symmetries, the 21 independent elastic constants can be progressively reduced in number, ultimately reaching 2 constants for a perfectly isotropic material. Ex)A triclinic crystalline substance has 21 elastic constants Ex)A monoclinic crystal has symmetry with respect to one plane and has 13 independent elastic constants Ex)Orthorhombic symmetry has symmetry with respect to three planes and has 9 independent elastic constants Ex)Hexagonal symmetry has 5 independent elastic constants For example, “transverse isotropy is common in seismology and has symmetry in a plane perpendicular to the z axis.This is common in seismology related to stacks of thin layers with a vertical axis of symmetry.In general, hexagonal symmetry can have an arbitrary axis of symmetry and can result from crystal symmetry, a stack of thin layers in a sedimentary rock, or a crack network in a rock. The Love matrix for hexagonal symmetry is An excellent survey of wave propagation in general anisotropic materials is given by Auld (1990).Applications of seismic anisotropy in the Earth are given by Babuska and Cara (1991). Proceeding in this way, we come to the case where the constants are invariant to an arbitrary rotation of the coordinate axes.This is called isotropy.Although no crystal has this symmetry, it is the most common one used in elasticity and seismology.It is appropriate for fine-grained materials with grains with random orientations. The Love matrix for an isotropic material is The Lamé constants for an isotropic material are which is called the shear modulus, and .Then, In full index notation, the isotropic elastic constants can be written as where Expressing the stress-strain relation for a linear elastic solid as , then for an isotropic material For each component, this can be written as In addition to and , other isotropic elastic constants are sometimes easier to measure in the lab. 1)Young’s Modulus Consider a bar under uniaxial compression (or tension ) , then For this case, where E is called Young’s Modulus and is measured as the ratio of uniaxial stress to strain. 2)Poisson’s ratio = Poisson’s ratio is the ratio of contraction in the direction of applied stress to the expansion in directions perpendicular to the applied stress.Thus, ( is not viscosity here!) where L 2 is a direction perpendicular to the direction of applied stress.Then, The two constants can be used to describe the isotropic elastic properties in a similar manner as and are easier to measure. Now, we want to relate to .For uniaxial compression, Then, a) b) From these, we can find relations between and as Now consider a bar under hydrostatic stress where , ( recall pressure is positive in compression). Now, find the forces in the different coordinate directions and relate to the strain in the x 1 direction. a) forces in x 1 b) forces in x 2 c) forces in x 3 where is the Poisson’s ratio and E is Young’s Modulus.Then, the total change length in the x 1 direction from all applied stresses is The change of volume can then be written as Thus, K is called the bulk modulus relating an applied hydrostatic pressure to a change of volume. Under an applied shear stress, say , with all other stresses being zero, then Thus, shear modulus is related to the ratio of shear stress to shear strain.Note, with our definitions of stress, there is also a factor of 2. We can relate K to as .Thus, we could also use as the independent elastic parameters for an isotropic material. The coefficients and are related to the more commonly measured elastic coefficients by = G = shear modulus E = Young’s modulus = = Poisson’s ratio = K = Bulk modulus = A complete set of relations for an isotropic medium is given in the box below from Stein and Wysession (2003). Poisson’s ratio is a very important diagnostic property of an isotropic elastic material.It can vary from -1 to +.5.For a perfectly rigid material, = 0.For an incompressible fluid, = .5 and . When under uniaxial compression, a material with a Poisson’s ratio of zero won’t come out on the sides.An example of this type of material is cork which is used as a bottle stopper.A material with a negative Poisson’s ratio would come in on the sides under uniaxial compression.But, it is important to remember that Poisson’s ratio is an isotropic and not anisotropic concept. The solid part of the Earth (coast to mantle) has a Poisson’s ratio that varies between .25 to .30 .For a Poisson solid, we choose and for this case .This relation is assumed in a great number of seismic studies of the solid parts of the Earth.In the Earth’s liquid outer core , .In the inner core, which is in a solid state, ~ 0.40-0.45. That is, the inner core can support shear, but is quite different from the mantle material.
0 个评论
[转载]Data Showing Non-invasive Fetal Karyotype
genesquared 2013-1-30 15:55
http://www.ariosadx.com/about-the-science/Ariosa-Paper-Selective-Analysis-PRENATAL-DIAGNOSIS-2012-1.pdf http://www.sciencedaily.com/releases/2012/06/120605155950.htm =================== Verinata Health Publishes Proof-of-Concept Data Showing Non-invasive Fetal Karyotype Equivalent to Invasive Procedures http://www.verinata.com/verinata-healths-non-invasive-prenatal-testing-position-statement/ http://www.verinata.com/news/verinata-health-publishes-proof-of-concept-data-showing-non-invasive-fetal-karyotype-equivalent-to-invasive-procedures/ 2013 Article Noninvasive Detection of Fetal Subchromosome Abnormalities via Deep Sequencing of Maternal Plasma Anupama Srinivasan1, Diana W. Bianchi2, Hui Huang1, Amy J. Sehnert1, Richard P. Rava1, , 1 Verinata Health, Inc., Redwood City, CA 94063, USA 2 Mother Infant Research Institute at Tufts Medical Center and Tufts University School of Medicine, Boston, MA 02111, USA ============ Illumina Acquires Verinata for $450 Million http://www.burrillreport.com/article-illumina_acquires_verinata_for_450_million.html
个人分类: prenatal|0 个评论
PMP学习系列之二
windlight 2012-12-12 17:40
赵凤光 PMP学习有感系列之一 今天说点题外话:PMP一个学科体系有自身的道德规范,以及自身的商业模式推广。 这两件事情遇到一起的时候,经常会有以子之矛攻子之盾的问题出现。 参加完8号考试,微博和群里就出现大量言论,怀疑某培训机构漏题,基本大家都要
1 次阅读|0 个评论
[转载]miR146A SNP Rs2910164 & Cancer Risk
genesquared 2012-11-27 09:35
http://snpedia.com/index.php/Rs2910164 Rs2910164 is a snp is mentioned by dbSNP rs2910164 PheGenI rs2910164 nextbio rs2910164 hapmap rs2910164 1000 genomes rs2910164 hgdp rs2910164 ensembl rs2910164 gopubmed rs2910164 geneview rs2910164 scholar rs2910164 google rs2910164 pharmgkb rs2910164 gwascentral rs2910164 openSNP rs2910164 23andMe rs2910164 23andMe all rs2910164 SNP Nexus SNPshot rs2910164 SNPdbe rs2910164 MSV3d rs2910164 Gene MIR146A Chromosome 5 Orientation plus GMAF 0.3814 Position 159912418 Reference GRCh37 37.1/131 Max Magnitude 2.5 Geno Mag Summary (C;C) 2.5 higher risk cancer (C;G) higher/earlier cancer likelihood?? (G;G) 0 normal ? (C;C) (C;G) (G;G) 28 (C;C) predisposes to papillary thyroid carcinoma. Among 42 patients with familial breast cancer and 82 patients with ovarian cancer, those with at least one rs2910164(C) SNP tended to have been diagnosed at an earlier age than those with only (G) alleles. For the breast cancer patients, the difference in median age was between 45 vs 56 (p = 0.029) and for ovarian cancer patients, 45 vs 50, (p = 0.014). (G;G) males were two-fold more susceptible to hepatocellular carcinoma (OR = 2.016, 95% CI = 1.056-3.848, P = 0.034) perhaps an ambiguous flip? snp near microRNA ACC MI0000477 ID hsa-mir-146a offset -10 A Functional Genetic Variant in microRNA-196a2 Is Associated with Increased Susceptibility of Lung Cancer in Chinese. Functional variant in microRNA-196a2 contributes to the susceptibility of congenital heart disease in a Chinese population OMIM 610566 Desc MICRO RNA 146A; MIRN146A Variant Related also Evaluation of SNPs in miR-146a, miR196a2 and miR-499 as low-penetrance alleles in German and Italian familial breast cancer cases A functional polymorphism in Pre-miR-146a gene is associated with prostate cancer risk and mature miR-146a expression in vivo Analyses of polymorphisms in the inflammasome-associated NLRP3 and miRNA-146A genes in the susceptibility to and tubal pathology of Chlamydia trachomatis infection Common genetic polymorphisms in pre-microRNAs were associated with increased risk of dilated cardiomyopathy The Role of microRNA-146a (miR-146a) and its Target IL-1R-Associated Kinase (IRAK1) in Psoriatic Arthritis Susceptibility Genetic variants in selected pre-microRNA genes and the risk of squamous cell carcinoma of the head and neck The association between two polymorphisms in pre-miRNAs and breast cancer risk: a meta-analysis A functional varient in microRNA-146a is associated with risk of esophageal squamous cell carcinoma in Chinese Han Genetic variation in microRNA genes and prostate cancer risk in North Indian population Association Study of Common Genetic Variants in Pre-microRNAs in Patients with Ulcerative Colitis A polymorphism in the 3'-UTR of interleukin-1 receptor-associated kinase (IRAK1), a target gene of miR-146a, is associated with rheumatoid arthritis susceptibility Association Between Common Genetic Variants in Pre-microRNAs and Gastric Cancer Risk in Japanese Population Genetic study of two single nucleotide polymorphisms within corresponding microRNAs and susceptibility to tuberculosis in a Chinese Tibetan and Han population The rs2910164:GC SNP in the MIR146A gene is not associated with breast cancer risk in BRCA1 and BRCA2 mutation carriers Effects of Common Polymorphisms rs11614913 in miR-196a2 and rs2910164 in miR-146a on Cancer Susceptibility: A Meta-Analysis A functional polymorphism in the pre-miR-146a gene is associated with risk and prognosis in adult glioma No association of pre-microRNA-146a rs2910164 polymorphism and risk of hepatocellular carcinoma development in Turkish population: A case-control study Association Between Two Genetic Variants in miRNA and Primary Liver Cancer Risk in the Chinese Population Genetic association of miRNA-146a with systemic lupus erythematosus in Europeans through decreased expression of the gene Thyroid cancer susceptibility polymorphisms: confirmation of loci on chromosomes 9q22 and 14q13, validation of a recessive 8q24 locus and failure to replicate a locus on 5q24 Association between the rs2910164 polymorphism in pre-mir-146a and oral carcinoma progression. A Genetic Variant in miR-196a2 Increased Digestive System Cancer Risks: A Meta-Analysis of 15 Case-Control Studies Differential Expression Profile and Genetic Variants of MicroRNAs Sequences in Breast Cancer Patients Increased Risk of Breast Cancer Associated with CC Genotype of Has-miR-146a Rs2910164 Polymorphism in Europeans Lack of Association of miR-146a rs2910164 Polymorphism with Gastrointestinal Cancers: Evidence from 10206 Subjects MiR-146a polymorphism is associated with asthma but not with systemic lupus erythematosus and juvenile rheumatoid arthritis in Mexican patients Genetic variants of miRNA sequences and non-small cell lung cancer survival. Common genetic variants in pre-microRNAs were associated with increased risk of breast cancer in Chinese women. Single nucleotide polymorphisms of microRNA machinery genes modify the risk of renal cell carcinoma. Genetic variations in microRNA-related genes are novel susceptibility loci for esophageal cancer risk. Polymorphic mature microRNAs from passenger strand of pre-miR-146a contribute to thyroid cancer. Signatures of purifying and local positive selection in human miRNAs. MicroRNA polymorphisms: the future of pharmacogenomics, molecular epidemiology and individualized medicine. Comprehensive analysis of the impact of SNPs and CNVs on human microRNAs and their regulatory genes. SNPs in human miRNA genes affect biogenesis and function. Common genetic variants in pre-microRNAs are associated with risk of coal workers' pneumoconiosis. MicroRNA polymorphisms: a giant leap towards personalized medicine. Common genetic variants in pre-microRNAs and risk of gallbladder cancer in North Indian population. Combined effect of miR-146a rs2910164 G/C polymorphism and Toll-like receptor 4 +3725 G/C polymorphism on the risk of severe gastric atrophy in Japanese. Association between hsa-mir-146a genotype and tumor age-of-onset in BRCA1/BRCA2-negative familial breast and ovarian cancer patients. Association of pre-microRNAs genetic variants with susceptibility in systemic lupus erythematosus. Association study of single nucleotide polymorphisms in pre-miRNA and rheumatoid arthritis in a Han Chinese population. Common genetic polymorphisms in pre-microRNAs and risk of cervical squamous cell carcinoma. Investigative role of pre-microRNAs in bladder cancer patients: a case-control study in North India. Polymorphism of the pre-miR-146a is associated with risk of cervical cancer in a Chinese population. Association between single-nucleotide polymorphisms in pre-miRNAs and the risk of asthma in a Chinese population. Association between two single nucleotide polymorphisms at corresponding microRNA and schizophrenia in a Chinese population. Expression and genetic analysis of miRNAs involved in CD4+ cell activation in patients with multiple sclerosis. Genetic variants in miR-146a, miR-149, miR-196a2, miR-499 and their influence on relative expression in lung cancers. Has-miR-146a polymorphism (rs2910164) and cancer risk: a meta-analysis of 19 case-control studies. Association of polymorphisms in pre-miRNA with inflammatory biomarkers in rheumatoid arthritis in the Chinese Han population. Meta-analysis confirms that a common G/C variant in the pre-miR-146a gene contributes to cancer susceptibility and that ethnicity, gender and smoking status are risk factors Evaluation of SNPs in miR-196-a2, miR-27a and miR-146a as risk factors of colorectal cancer Association of pre-miRNA-146a rs2910164 and pre‑miRNA-499 rs3746444 polymorphisms and susceptibility to rheumatoid arthritis
个人分类: miR196A2|0 个评论
[转载]3-Gene Model to Robustly Identify Breast Cancer Subtypes
genesquared 2012-11-12 10:06
http://jnci.oxfordjournals.org/content/104/4/311.long A Three-Gene Model to Robustly Identify Breast Cancer Molecular Subtypes Benjamin Haibe-Kains, Christine Desmedt, Sherene Loi, Aedin C. Culhane, Gianluca Bontempi, John Quackenbush and Christos Sotiriou + Author Affiliations Affiliations of authors: Department of Biostatistics and Computational Biology (BH-K, ACC, JQ) and Department of Cancer Biology (JQ), Dana-Farber Cancer Institute, Boston, MA; Department of Biostatistics, Harvard School of Public Health, Boston, MA (BH-K, ACC, JQ); Breast Cancer Translational Research Laboratory J.C. Heuson, Medical Oncology Department, Jules Bordet Institute, Université Libre de Bruxelles, Brussels, Belgium (CD, SL, CS); Machine Learning Group, Computer Science Department, Université Libre de Bruxelles, Brussels, Belgium (GB) Correspondence to: Benjamin Haibe-Kains, PhD, Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA 02115 (e-mail: bhaibeka@jimmy.harvard.edu). Received June 23, 2011. Revision received December 13, 2011. Accepted December 14, 2011. Next Section Abstract Background Single sample predictors (SSPs) and Subtype classification models (SCMs) are gene expression–based classifiers used to identify the four primary molecular subtypes of breast cancer (basal-like, HER2-enriched, luminal A, and luminal B). SSPs use hierarchical clustering, followed by nearest centroid classification, based on large sets of tumor-intrinsic genes. SCMs use a mixture of Gaussian distributions based on sets of genes with expression specifically correlated with three key breast cancer genes (estrogen receptor , HER2, and aurora kinase A ). The aim of this study was to compare the robustness, classification concordance, and prognostic value of these classifiers with those of a simplified three-gene SCM in a large compendium of microarray datasets. Methods Thirty-six publicly available breast cancer datasets (n = 5715) were subjected to molecular subtyping using five published classifiers (three SSPs and two SCMs) and SCMGENE, the new three-gene (ER, HER2, and AURKA) SCM. We used the prediction strength statistic to estimate robustness of the classification models, defined as the capacity of a classifier to assign the same tumors to the same subtypes independently of the dataset used to fit it. We used Cohen κ and Cramer V coefficients to assess concordance between the subtype classifiers and association with clinical variables, respectively. We used Kaplan–Meier survival curves and cross-validated partial likelihood to compare prognostic value of the resulting classifications. All statistical tests were two-sided. Results SCMs were statistically significantly more robust than SSPs, with SCMGENE being the most robust because of its simplicity. SCMGENE was statistically significantly concordant with published SCMs (κ = 0.65–0.70) and SSPs (κ = 0.34–0.59), statistically significantly associated with ER (V = 0.64), HER2 (V = 0.52) status, and histological grade (V = 0.55), and yielded similar strong prognostic value. Conclusion Our results suggest that adequate classification of the major and clinically relevant molecular subtypes of breast cancer can be robustly achieved with quantitative measurements of three key genes. CONTEXTS AND CAVEATS Prior knowledge Single sample predictors (SSPs) are molecular classification models that use large sets of genes expressed in different tumors to classify different subtypes of breast cancer. Subtype classification models (SCMs) are based on groups of genes specifically correlated with three key breast cancer genes, estrogen receptor (ER), HER2, and aurora kinase A (AURKA). Both types of models use large numbers of genes. However, the robustness and prognostic value of these classifiers have not been compared with simplified models containing fewer genes. Study design A simplified SCM (SCMGENE) containing only ER, HER2, and AURKA was compared with three SSPs and two SCMs using data from 36 gene expression datasets in public databases. The models were compared with respect to concordance among themselves as well as association with clinical variables and disease-free survival. Contribution Among the SCMs, SCMGENE with only three genes was statistically more robust than SSPs and as robust and yielded similar prognostic value compared with the published SCMs that use large numbers of genes. Implications Adding more genes to a classification model may not improve the ability to discriminate among breast cancer subtypes. In addition, the complexity of multiple-gene classification models may limit their usefulness and translation into clinic. Limitations The datasets used were retrospectively accrued; therefore, the selection of patients may have resulted in unbalanced distribution of the different molecular subtypes. The gene expression datasets taken from public databases and websites were not renormalized. Software limitations precluded checking or correction for departure from proportional hazards assumptions. From the Editors ============ J Natl Cancer Inst. 2012 Feb 22;104(4):262-3. Epub 2012 Jan 18. Gene signatures revisited. Baker SG. Comment on A three-gene model to robustly identify breast cancer molecular subtypes. PMID: 22262869 PMCID: PMC3283539
个人分类: breastCancer|1 次阅读|0 个评论
[转载]MIT Computational Biology group - Papers
热度 1 genesquared 2012-9-30 19:49
MIT Computational Biology group - Papers publishedOther views: You can find our papers grouped by area under Research Projects . You can find our most recent papers on Pubmed . You can find our most cited papers on Google scholar . Jump to: Journal Publications ( Single PDF , 590 pages, 46MB) Conference Proceedings and Book Chapters ( Single PDF , 73 pages, 3MB) Theses and Internal Reports ( Single PDF , 330 pages, 7MB) Journal Publications 79. Computational analysis of noncoding RNAs Washietl, Will, Hendrix, Goff, Rinn, Berger, Kellis. Noncoding RNAs have emerged as important key players in the cell. Understanding their surprisingly diverse range of functions is challenging for experimental and computational biology. Here, we review computational methods to analyze noncoding RNAs. The topics covered include basic and advanced techniques to predict RNA structures, annotation of noncoding RNAs in genomic data, mining RNA-seq data for novel transcripts and prediction of transcript structures, computational aspects of microRNAs, and database resources. These authors contributed equally WIREs RNA 2012. doi: 10.1002/wrna.1134 For further resources related to this article, please visit the WIREs website. Wiley Reviews RNA, Sep 18, 2012 78. High depth, whole-genome sequencing of cholera isolates from Haiti and the Dominican Republic Sealfon, Gire, Ellis, Calderwood, Qadri, Hensley, Kellis, Ryan, Larocque, Harris, Sabeti Whole-genome sequencing is an important tool for understanding microbial evolution and identifying the emergence of functionally important variants over the course of epidemics. In October 2010, a severe cholera epidemic began in Haiti, with additional cases identified in the neighboring Dominican Republic. We used whole-genome approaches to sequence four Vibrio cholerae isolates from Haiti and the Dominican Republic and three additional V. cholerae isolates to a high depth of coverage (2000x); four of the seven isolates were previously sequenced. Using these sequence data, we examined the effect of depth of coverage and sequencing platform on genome assembly and identification of sequence variants. We found that 50x coverage is sufficient to construct a whole-genome assembly and to accurately call most variants from 100 base pair paired-end sequencing reads. Phylogenetic analysis between the newly sequenced and thirty-three previously sequenced V. cholerae isolates indicates that the Haitian and Dominican Republic isolates are closest to strains from South Asia. The Haitian and Dominican Republic isolates form a tight cluster, with only four variants unique to individual isolates. These variants are located in the CTX region, the SXT region, and the core genome. Of the 126 mutations identified that separate the Haiti-Dominican Republic cluster from the V. cholerae reference strain (N16961), 73 are non-synonymous changes, and a number of these changes cluster in specific genes and pathways. Sequence variant analyses of V. cholerae isolates, including multiple isolates from the Haitian outbreak, identify coverage-specific and technology-specific effects on variant detection, and provide insight into genomic change and functional evolution during an epidemic. BMC Genomics 13:468, Sep 11, 2012 77. Evidence of Abundant Purifying Selection in Humans for Recently Acquired Regulatory Functions Ward, Kellis Although only 5% of the human genome is conserved across mammals, a substantially larger portion is biochemically active, raising the question of whether the additional elements evolve neutrally or confer a lineage-specific fitness advantage. To address this question, we integrate human variation information from the 1000 Genomes Project and activity data from the ENCODE Project. A broad range of transcribed and regulatory nonconserved elements show decreased human diversity, suggesting lineage-specific purifying selection. Conversely, conserved elements lacking activity show increased human diversity, suggesting that some recently became nonfunctional. Regulatory elements under human constraint in nonconserved regions were found near color vision and nerve-growth genes, consistent with purifying selection for recently evolved functions. Our results suggest continued turnover in regulatory regions, with at least an additional 4% of the human genome subject to lineage-specific constraint. Science doi:10.1126/science.1225057, Sep 5 2012 76. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia Landt, Marinov, Kundaje, Kheradpour, Pauli, Batzoglou, Bernstein, Bickel, Brown, Cayting, Chen, Desalvo, Epstein, Fisher-Aylor, Euskirchen, Gerstein, Gertz, Hartemink, Hoffman, Iyer, Jung, Karmakar, Kellis, Kharchenko, Li, Liu, Liu, Ma, Milosavljevic, Myers, Park, Pazin, Perry, Raha, Reddy, Rozowsky, Shoresh, Sidow, Slattery, Stamatoyannopoulos, Tolstorukov, White, Xi, Farnham, Lieb, Wold, Snyder Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals Genome Research 22(9):1813-31, Sep 2012 75. GENCODE: The reference human genome annotation for The ENCODE Project Harrow, Frankish, Gonzalez, Tapanari, Diekhans, Kokocinski, Aken, Barrell, Zadissa, Searle, Barnes, Bignell, Boychenko, Hunt, Kay, Mukherjee, Rajan, Despacio-Reyes, Saunders, Steward, Harte, Lin, Howald, Tanzer, Derrien, Chrast, Walters, Balasubramanian, Pei, Tress, Rodriguez, Ezkurdia, van Baren, Brent, Haussler, Kellis, Valencia, Reymond, Gerstein, Guigo, Hubbard The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers. Genome Research 22:1760-74, Sep 2012. 74. An integrated encyclopedia of DNA elements in the human genome ENCODE Project Consortium The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research. Nature 489:57-74. Sep 6, 2012 73. Analysis of variation at transcription factor binding sites in Drosophila and humans Spivakov, Akhtar, Kheradpour, Beal, Girardot, Koscielny, Herrero, Kellis, Furlong, Birney Advances in sequencing technology have boosted population genomics and made it possible to map the positions of transcription factor binding sites (TFBSs) with high precision. Here we investigate TFBS variability by combining transcription factor binding maps generated by ENCODE, modENCODE, our previously published data and other sources with genomic variation data for human individuals and Drosophila isogenic lines. We introduce a metric of TFBS variability that takes into account changes in motif match associated with mutation and makes it possible to investigate TFBS functional constraints instance-by-instance as well as in sets that share common biological properties. We also take advantage of the emerging per-individual transcription factor binding data to show evidence that TFBS mutations, particularly at evolutionarily conserved sites, can be efficiently buffered to ensure coherent levels of transcription factor binding. Our analyses provide insights into the relationship between individual and interspecies variation and show evidence for the functional buffering of TFBS mutations in both humans and flies. In a broad perspective, these results demonstrate the potential of combining functional genomics and population genetics approaches for understanding gene regulation. Genome Biology 13:R49, Sep 5, 2012 72. TreeFix: Statistically Informed Gene Tree Error Correction using Species Trees Wu, Rasmussen, Bansal, Kellis. Accurate gene tree reconstruction is a fundamental problem in phylogenetics, with many important applications. However, sequence data alone often lack enough information to confidently support one gene tree topology over many competing alternatives. Here, we present a novel framework for combining sequence data and species tree information, and we describe an implementation of this framework in TreeFix, a new phylogenetic program for improving gene tree reconstructions. Given a gene tree (preferably computed using a maximum likelihood phylogenetic program), TreeFix finds a "statistically equivalent" gene tree that minimizes a species tree based cost function. We have applied TreeFix to two clades of 12 Drosophila and 16 fungal genomes, as well as to simulated phylogenies, and show that it dramatically improves reconstructions compared to current state-of-the-art programs. Given its accuracy, speed, and simplicity, TreeFix should be applicable to a wide range of analyses and have many important implications for future investigations of gene evolution. The source code and a sample dataset are available at http://compbio.mit.edu/treefix Systematic Biology, Sep 4, 2012 71. Wisdom of crowds for robust gene network inference Marbach, Costello, Kuffner, Vega, Prill, Camacho, Allison, DREAM5 Consortium, Kellis, Collins, Stolovitzky Reconstructing gene regulatory networks from high-throughput data is a long-standing challenge. Through the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we performed a comprehensive blind assessment of over 30 network inference methods on Escherichia coli, Staphylococcus aureus, Saccharomyces cerevisiae and in silico microarray data. We characterize the performance, data requirements and inherent biases of different inference approaches, and we provide guidelines for algorithm application and development. We observed that no single inference method performs optimally across all data sets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse data sets. We thereby constructed high-confidence networks for E. coli and S. aureus, each comprising ~1,700 transcriptional interactions at a precision of ~50%. We experimentally tested 53 previously unobserved regulatory interactions in E. coli, of which 23 (43%) were supported. Our results establish community-based methods as a powerful and robust tool for the inference of transcriptional gene regulatory networks. Nature Methods 2012, doi:10.1038/nmeth.2016, 15 July 2012 70. Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss Bansal, Alm, Kellis Gene family evolution is driven by evolutionary events such as speciation, gene duplication, horizontal gene transfer and gene loss, and inferring these events in the evolutionary history of a given gene family is a fundamental problem in comparative and evolutionary genomics with numerous important applications. Solving this problem requires the use of a reconciliation framework, where the input consists of a gene family phylogeny and the corresponding species phylogeny, and the goal is to reconcile the two by postulating speciation, gene duplication, horizontal gene transfer and gene loss events. This reconciliation problem is referred to as duplication-transfer-loss (DTL) reconciliation and has been extensively studied in the literature. Yet, even the fastest existing algorithms for DTL reconciliation are too slow for reconciling large gene families and for use in more sophisticated applications such as gene tree or species tree reconstruction. Here, we present two new algorithms for the DTL reconciliation problem that are dramatically faster than existing algorithms, both asymptotically and in practice. We also extend the standard DTL reconciliation model by considering distance-dependent transfer costs, which allow for more accurate reconciliation and give an efficient algorithm for DTL reconciliation under this extended model. We implemented our new algorithms and demonstrated up to 100 000-fold speed-up over existing methods, using both simulated and biological datasets. This dramatic improvement makes it possible to use DTL reconciliation for performing rigorous evolutionary analyses of large gene families and enables its use in advanced reconciliation-based gene and species tree reconstruction methods. Our programs can be freely downloaded from http://compbio.mit.edu/ranger-dtl/. Bioinformatics 28:i283-i291 and ISMB, Jun 15, 2012 69. Common variants at 9p21 and 8q22 are associated with increased susceptibility to optic nerve degeneration in glaucoma Wiggs, Yaspan, Hauser, Kang, Allingham, Olson, Abdrabou, Fan, Wang, Brodeur, Budenz, Caprioli, Crenshaw, Crooks, Delbono, Doheny, Friedman, Gaasterland, Gaasterland, Laurie, Lee, Lichter, Loomis, Liu, Medeiros, McCarty, Mirel, Moroi, Musch, Realini, Rozsa, Schuman, Scott, Singh, Stein, Trager, Vanveldhuisen, Vollrath, Wollstein, Yoneyama, Zhang, Weinreb, Ernst, Kellis, Masuda, Zack, Richards, Pericak-Vance, Pasquale, Haines Optic nerve degeneration caused by glaucoma is a leading cause of blindness worldwide. Patients affected by the normal-pressure form of glaucoma are more likely to harbor risk alleles for glaucoma-related optic nerve disease. We have performed a meta-analysis of two independent genome-wide association studies for primary open angle glaucoma (POAG) followed by a normal-pressure glaucoma (NPG, defined by intraocular pressure (IOP) less than 22 mmHg) subgroup analysis. The single-nucleotide polymorphisms that showed the most significant associations were tested for association with a second form of glaucoma, exfoliation-syndrome glaucoma. The overall meta-analysis of the GLAUGEN and NEIGHBOR dataset results (3,146 cases and 3,487 controls) identified significant associations between two loci and POAG: the CDKN2BAS region on 9p21 (rs2157719 , OR=0.69 , p=1.86x10(-18)), and the SIX1/SIX6 region on chromosome 14q23 (rs10483727 , OR=1.32 , p=3.87x10(-11)). In sub-group analysis two loci were significantly associated with NPG: 9p21 containing the CDKN2BAS gene (rs2157719 , OR=0.58 , p=1.17x10(-12)) and a probable regulatory region on 8q22 (rs284489 , OR=0.62 , p=8.88x10(-10)). Both NPG loci were also nominally associated with a second type of glaucoma, exfoliation syndrome glaucoma (rs2157719 , OR=0.59 , p=0.004 and rs284489 , OR=0.76 , p=0.021), suggesting that these loci might contribute more generally to optic nerve degeneration in glaucoma. Because both loci influence transforming growth factor beta (TGF-beta) signaling, we performed a genomic pathway analysis that showed an association between the TGF-beta pathway and NPG (permuted p=0.009). These results suggest that neuro-protective therapies targeting TGF-beta signaling could be effective for multiple forms of glaucoma. PLoS Genetics 8(4):e1002654, April 26, 2012. 68. Predictive regulatory models in Drosophila melanogaster by integrative inference of transcriptional networks Marbach, Roy, Ay, Meyer, Candeias, Kahveci, Bristow, Kellis Gaining insights on gene regulation from large-scale functional datasets is a grand challenge in systems biology. In this paper, we develop and apply methods for transcriptional regulatory network inference from diverse functional genomics datasets, and demonstrate their value for gene function and gene expression prediction. We formulate the network inference problem in a machine learning framework and use both supervised and unsupervised methods to predict regulatory edges by integrating transcription factor (TF) binding, evolutionarily-conserved sequence motifs, gene expression, and chromatin modification datasets as input features. Applying these methods to Drosophila melanogaster, we predict ~300k regulatory edges in a network of ~600 TFs and 12k target genes. We validate our predictions using known regulatory interactions, gene functional annotations, tissue-specific expression, protein-protein interactions, and three-dimensional maps of chromosome conformation. We use the inferred network to identify putative functions for hundreds of previously-uncharacterized genes, including many in nervous system development, which are independently confirmed based on their tissue-specific expression patterns. Lastly, we use the regulatory network to predict target gene expression levels as a function of TF expression, and find significantly higher predictive power for integrative networks than for motif or ChIP-based networks. Our work reveals the complementarity between physical evidence of regulatory interactions (TF binding, motif conservation) and functional evidence (coordinated expression or chromatin patterns), and demonstrates the power of data integration for network inference and studies of gene regulation at the systems level. Genome Research, Mar 28, 2012. 67. ChromHMM: automating chromatin-state discovery and characterization Ernst, Kellis Chromatin-state annotation using combinations of chromatin modification patterns has emerged as a powerful approach for discovering regulatory regions and their cell type-specific activity patterns and for interpreting disease-association studies. However, the computational challenge of learning chromatin-state models from large numbers of chromatin modification datasets in multiple cell types still requires extensive bioinformatics expertise. To address this challenge, we developed ChromHMM, an automated computational system for learning chromatin states, characterizing their biological functions and correlations with large-scale functional datasets and visualizing the resulting genome-wide maps of chromatin-state annotations. Nature Methods 9:215-6, Feb 28, 2012. 66. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay Melnikov, Murugan, Zhang, Tesileanu, Wang, Rogov, Feizi, Gnirke, Callan, Kinney, Kellis, Lander, Mikkelsen Learning to read and write the transcriptional regulatory code is of central importance to progress in genetic analysis and engineering. Here we describe a massively parallel reporter assay (MPRA) that facilitates the systematic dissection of transcriptional regulatory elements. In MPRA, microarray-synthesized DNA regulatory elements and unique sequence tags are cloned into plasmids to generate a library of reporter constructs. These constructs are transfected into cells and tag expression is assayed by high-throughput sequencing. We apply MPRA to compare 27,000 variants of two inducible enhancers in human cells: a synthetic cAMP-regulated enhancer and the virus-inducible interferon-Î2 enhancer. We first show that the resulting data define accurate maps of functional transcription factor binding sites in both enhancers at single-nucleotide resolution. We then use the data to train quantitative sequence-activity models (QSAMs) of the two enhancers. We show that QSAMs from two cellular states can be combined to design enhancer variants that optimize potentially conflicting objectives, such as maximizing induced activity while minimizing basal activity. Nature Biotechnology Feb 26, 2012 65. RNA folding with soft constraints: reconciliation of probing data and thermodynamic secondary structure prediction Washietl, Hofacker, Stadler, Kellis Thermodynamic folding algorithms and structure probing experiments are commonly used to determine the secondary structure of RNAs. Here we propose a formal framework to reconcile information from both prediction algorithms and probing experiments. The thermodynamic energy parameters are adjusted using 'pseudo-energies' to minimize the discrepancy between prediction and experiment. Our framework differs from related approaches that used pseudo-energies in several key aspects. (i) The energy model is only changed when necessary and no adjustments are made if prediction and experiment are consistent. (ii) Pseudo-energies remain biophysically interpretable and hold positional information where experiment and model disagree. (iii) The whole thermodynamic ensemble of structures is considered thus allowing to reconstruct mixtures of suboptimal structures from seemingly contradicting data. (iv) The noise of the energy model and the experimental data is explicitly modeled leading to an intuitive weighting factor through which the problem can be seen as folding with 'soft' constraints of different strength. We present an efficient algorithm to iteratively calculate pseudo-energies within this framework and demonstrate how this approach can be used in combination with SHAPE chemical probing data to improve secondary structure prediction. We further demonstrate that the pseudo-energies correlate with biophysical effects that are known to affect RNA folding such as chemical nucleotide modifications and protein binding. Nucleic Acids Research, Jan 28, 2012 64. Unified modeling of gene duplication, loss, and coalescence using a locus tree Rasmussen, Kellis Gene phylogenies provide a rich source of information about the way evolution shapes genomes, populations, and phenotypes. In addition to substitutions, evolutionary events such as gene duplication and loss (as well as horizontal transfer) play a major role in gene evolution, and many phylogenetic models have been developed in order to reconstruct and study these events. However, these models typically make the simplifying assumption that population-related effects such as incomplete lineage sorting (ILS) are negligible. While this assumption may have been reasonable in some settings, it has become increasingly problematic as increased genome sequencing has led to denser phylogenies, where effects such as ILS are more prominent. To address this challenge, we present a new probabilistic model, DLCoal, that defines gene duplication and loss in a population setting, such that coalescence and ILS can be directly addressed. Interestingly, this model implies that in addition to the usual gene tree and species tree there exists a third tree, the locus tree, which will likely have many applications. Using this model, we develop the first general reconciliation method that accurately infers gene duplications and losses in the presence of ILS, and we show its improved inference of orthologs, paralogs, duplications, and losses for a variety of clades, including flies, fungi, and primates. Also, our simulations show that gene duplications increase the frequency of ILS, further illustrating the importance of a joint model. Going forward, we believe this unified model can offer insights to questions in both phylogenetics and population genetics. Genome Research Jan 23, 2012 63. Combinatorial Patterning of Chromatin Regulators Uncovered by Genome-wide Location Analysis in Human Cells Ram, Goren, Amit, Shoresh, Yosef, Ernst, Kellis, Gymrek, Issner, Coyne, Durham, Zhang, Donaghey, Epstein, Regev, Bernstein Hundreds of chromatin regulators (CRs) control chromatin structure and function by catalyzing and binding histone modifications, yet the rules governing these key processes remain obscure. Here, we present a systematic approach to infer CR function. We developed ChIP-string, a meso-scale assay that combines chromatin immunoprecipitation with a signature readout of 487 representative loci. We applied ChIP-string to screen 145 antibodies, thereby identifying effective reagents, which we used to map the genome-wide binding of 29 CRs in two cell types. We found that specific combinations of CRs colocalize in characteristic patterns at distinct chromatin environments, at genes of coherent functions, and at distal regulatory elements. When comparing between cell types, CRs redistribute to different loci but maintain their modular and combinatorial associations. Our work provides a multiplex method that substantially enhances the ability to monitor CR binding, presents a large resource of CR maps, and reveals common principles for combinatorial CR function. Cell 147:1628-39, Dec 23 2011 62. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants Ward, Kellis The resolution of genome-wide association studies (GWAS) is limited by the linkage disequilibrium (LD) structure of the population being studied. Selecting the most likely causal variants within an LD block is relatively straightforward within coding sequence, but is more difficult when all variants are intergenic. Predicting functional non-coding sequence has been recently facilitated by the availability of conservation and epigenomic information. We present HaploReg, a tool for exploring annotations of the non-coding genome among the results of published GWAS or novel sets of variants. Using LD information from the 1000 Genomes Project, linked SNPs and small indels can be visualized along with their predicted chromatin state in nine cell types, conservation across mammals and their effect on regulatory motifs. Sets of SNPs, such as those resulting from GWAS, are analyzed for an enrichment of cell type-specific enhancers. HaploReg will be useful to researchers developing mechanistic hypotheses of the impact of non-coding variants on clinical phenotypes and normal variation. The HaploReg database is available at http://compbio.mit.edu/HaploReg/ Nucleic Acids Research, Nov 7 2011 61. A high-resolution map of evolutionary constraint in the human genome based on 29 eutherian mammals Lindblad-Toh, Garber, Zuk, Lin, Parker, Washietl, Kheradpour, Ernst, Jordan, Mauceli, Ward, Lowe, Holloway, Clamp, Gnerre, Alfoldi, Beal, Chang, Clawson, Palma, Fitzgerald, Flicek, Guttman, Hubisz, Jaffe, Jungreis, Kostka, Lara, Martins, Massingham, Moltke, Raney, Rasmussen, Stark, Vilella, Wen, Xie, Zody, Worley, Kovar, Muzny, Gibbs, Warren, Mardis, Weinstock, Wilson, Birney, Margulies, Herrero, Green, Haussler, Siepel, Goldman, Pollard, Pedersen, Lander, Kellis The comparison of related genomes has emerged as a powerful lens for genome interpretation. Here we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and locate constrained elements covering 4.2% of the genome. We use evolutionary signatures and comparisons with experimental data sets to suggest candidate functions for 60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements and more than 1,000 primate- and human-accelerated elements. Overlap with disease-associated variants indicates that our findings will be relevant for studies of human biology, health and disease. Nature 478:476-82, Oct 12 2011 60. Evidence of abundant stop codon readthrough in Drosophila andother Metazoa Jungreis, Lin, Spokony, Chan, Negre, Victorsen, White, Kellis While translational stop codon readthrough is often utilized by viral genomes it has been observed for only a handful of eukaryotic genes. We previously used comparative genomics evidence to recognize protein-coding regions in 12 species of Drosophila and showed that for 149 genes the open reading frame following the stop codon has a protein-coding conservation signature, hinting that stop codon readthrough might be common in Drosophila. We return to this observation armed with deep RNA sequence data from the modENCODE project, an improved higher-resolution comparative genomics metric for detecting protein-coding regions, comparative sequence information from additional species, and directed experimental evidence. We report an expanded set of 283 readthrough candidates, including 16 double readthrough candidates; these were manually curated to rule out alternatives such as A-to-I editing, alternative splicing, dicistronic translation, and selenocysteine incorporation. We report experimental evidence of translation using GFP tagging and mass spectrometry for several readthrough regions. We find that the set of readthrough candidates differs from other genes in length, composition, conservation, stop codon context, and in some cases, conserved stem loops, providing clues about readthrough regulation and potential mechanisms. Lastly, we expand our studies beyond Drosophila and find evidence of abundant readthrough in several other insect species and one crustacean, and several readthrough candidates in nematode and human, suggesting that functionally-important translational stop codon readthrough is significantly more prevalent in Metazoa than previously recognized. Genome Research, Oct 12, 2011 59. New families of human regulatory RNA structures identified by comparative analysis of vertebrate genomes Parker, Moltke, Roth, Washietl, Wen, Kellis, Breaker, Pedersen Regulatory RNA structures are often members of families with multiple paralogous instances across the genome. Family members share functional and structural properties, which allow them to be studied as a whole, facilitating both bioinformatic and experimental characterization. We have developed a comparative method, EvoFam, for genome-wide identification of families of regulatory RNA structures, based on primary sequence and secondary structure similarity. We apply EvoFam to a 41-way genomic vertebrate alignment. Genome-wide, we identify 220 human, high-confidence families outside protein-coding regions comprising 725 individual structures, including 48 families with known structural RNA elements. Known families identified include both noncoding RNAs, e.g., miRNAs and the recently identified MALAT1/MEN b lincRNA family; and cis-regulatory structures, e.g., iron-responsive elements. We also identify tens of new families supported by strong evolutionary evidence and other statistical evidence, such as GO term enrichments. For some of these, detailed analysis has led to the formulation of specific functional hypotheses. Examples include two hypothesized auto-regulatory feedback mechanisms: one involving six long hairpins in the 39-UTR of MAT2A, a key metabolic gene that produces the primary human methyl donor S-adenosylmethionine; the other involving a tRNA-like structure in the intron of the tRNA maturation gene POP1. We experimentally validate the predicted MAT2A structures. Finally, we identify potential new regulatory networks, including large families of short hairpins enriched in immunity-related genes, e.g., TNF, FOS, and CTLA4, which include known transcript destabilizing elements. Our findings exemplify the diversity of posttranscriptional regulation and provide a resource for further characterization of new regulatory mechanisms and families of noncoding RNAs. Genome Research 21:1929-43, Oct 12 2011 58. Locating protein-coding sequences under purifying selection for additional, overlapping functions in 29 mammalian genomes Lin, Washietl, Kheradpour, Parker, Pedersen, Kellis The degeneracy of the genetic code allows protein-coding DNA and RNA sequences to simultaneously encode additional, overlapping functional elements. A sequence in which both protein-coding and additional overlapping functions have evolved under purifying selection should show increased evolutionary conservation compared to typical protein-coding genes - especially at synonymous sites. In this study, we use genome alignments of 29 placental mammals to systematically locate short regions within human ORFs that show conspicuously low estimated rates of synonymous substitution across these species. The 29-species alignment provides statistical power to locate more than 10,000 such regions with resolution down to nine-codon windows, which are found within more than a quarter of all human protein-coding genes and contain ~2% of their synonymous sites. We collect numerous lines of evidence that the observed synonymous constraint in these regions reflects selection on overlapping functional elements including splicing regulatory elements, dual-coding genes, RNA secondary structures, microRNA target sites, and developmental enhancers. Our results show that overlapping functional elements are common in mammalian genes, despite the vast genomic landscape. Genome Research 21:1916-28, Oct 12 2011. . 57. Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration Wu, Cheng, Keller, Ernst, Kumar, Mishra, Morrissey, Dorman, Chen, Drautz, Giardine, Shibata, Song, Pimkin, Crawford, Furey, Kellis, Miller, Taylor, Schuster, Zhang, Chiaromonte, Blobel, Weiss, Hardison. Interplays among lineage-specific nuclear proteins, chromatin modifying enzymes, and the basal transcription machinery govern cellular differentiation, but their dynamics of action and coordination with transcriptional control are not fully understood. Alterations in chromatin structure appear to establish a permissive state for gene activation at some loci, but they play an integral role in activation at other loci. To determine the predominant roles of chromatin states and factor occupancy in directing gene regulation during differentiation, we mapped chromatin accessibility, histone modifications, and nuclear factor occupancy genome-wide during mouse erythroid differentiation dependent on the master regulatory transcription factor GATA1. Notably, despite extensive changes in gene expression, the chromatin state profiles (proportions of a gene in a chromatin state dominated by activating or repressive histone modifications) and accessibility remain largely unchanged during GATA1-induced erythroid differentiation. In contrast, gene induction and repression are strongly associated with changes in patterns of transcription factor occupancy. Our results indicate that during erythroid differentiation, the broad features of chromatin states are established at the stage of lineage commitment, largely independently of GATA1. These determine permissiveness for expression, with subsequent induction or repression mediated by distinctive combinations of transcription factors. Genome Res. 2011 Oct;21(10):1659-71. Epub 2011 Jul 27. 56. Evolution at the Sub-gene Level: Domain Rearrangements in the Drosophila Phylogeny Wu, Rasmussen, Kellis Although the possibility of gene evolution by domain rearrangements has long been appreciated, current methods for reconstructing and systematically analyzing gene family evolution are limited to events such as duplication, loss, and sometimes, horizontal transfer. However, within the Drosophila clade, we find domain rearrangements occur in 35.9% of gene families, and thus, any comprehensive study of gene evolution in these species will need to account for such events. Here, we present a new computational model and algorithm for reconstructing gene evolution at the domain level. We develop a method for detecting homologous domains between genes, and present a phylogenetic algorithm for reconstructing maximum parsimony evolutionary histories that include domain generation, duplication, loss, merge (fusion), and split (fission) events. Using this method, we find that genes involved in fusion and fission are enriched in signaling and development, suggesting that domain rearrangements and reuse may be crucial in these processes. We also find that fusion is more abundant than fission and that fusion and fission events occur predominantly alongside duplication, with 92.5% and 34.3% of fusion and fission events retaining ancestral architectures in the duplicated copies. We provide a catalog of ~9000 genes that undergo domain rearrangement across nine sequenced species, along with possible mechanisms for their formation. These results dramatically expand on evolution at the sub-gene level and offer several insights into how new genes and functions arise between species. Mol Biol Evol. 2011 Sep 7. 55. Three periods of regulatory innovation during vertebrate evolution. Lowe, Kellis, Siepel, Raney, Clamp, Salama, Kingsley, Lindblad-Toh, Haussler The gain, loss, and modification of gene regulatory elements may underlie a substantial proportion of phenotypic changes on animal lineages. To investigate the gain of regulatory elements throughout vertebrate evolution, we identified genome-wide sets of putative regulatory regions for five vertebrates, including humans. These putative regulatory regions are conserved nonexonic elements (CNEEs), which are evolutionarily conserved yet do not overlap any coding or noncoding mature transcript. We then inferred the branch on which each CNEE came under selective constraint. Our analysis identified three extended periods in the evolution of gene regulatory elements. Early vertebrate evolution was characterized by regulatory gains near transcription factors and developmental genes, but this trend was replaced by innovations near extracellular signaling genes, and then innovations near posttranslational protein modifiers. Science. 2011 Aug 19;333(6045):1019-24. 54. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions Lin, Jungreis, Kellis As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multispecies nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. We show that PhyloCSF's classification performance in 12-species Drosophila genome alignments exceeds all other methods we compared in a previous study. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE, and as interest grows in long non-coding RNAs, often initially recognized by their lack of protein coding potential rather than conserved RNA secondary structures. The Objective Caml source code and executables for GNU/Linux and Mac OS X are freely available at http://compbio.mit.edu/PhyloCSF Bioinformatics. 2011 Jul 1;27(13):i275-82. 53. A user's guide to the encyclopedia of DNA elements (ENCODE) ENCODE Project Consortium, Myers, Stamatoyannopoulos, Snyder, Dunham, Hardison, Bernstein, Gingeras, Kent, Birney, Wold, Crawford The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome. PLoS Biol. 2011 Apr;9(4):e1001046. Epub 2011 Apr 19. 52. Extensive and coordinated transcription of noncoding RNAs within cell-cycle promoters Hung, Wang, Lin, Koegel, Kotake, Grant, Horlings, Shah, Umbricht, Wang, Wang, Kong, Langerod, Børresen-Dale, Kim, Vijver, Sukumar, Whitfield, Kellis, Xiong, Wong, Chang Transcription of long noncoding RNAs (lncRNAs) within gene regulatory elements can modulate gene activity in response to external stimuli, but the scope and functions of such activity are not known. Here we use an ultrahigh-density array that tiles the promoters of 56 cell-cycle genes to interrogate 108 samples representing diverse perturbations. We identify 216 transcribed regions that encode putative lncRNAs, many with RT-PCR-validated periodic expression during the cell cycle, show altered expression in human cancers and are regulated in expression by specific oncogenic stimuli, stem cell differentiation or DNA damage. DNA damage induces five lncRNAs from the CDKN1A promoter, and one such lncRNA, named PANDA, is induced in a p53-dependent manner. PANDA interacts with the transcription factor NF-YA to limit expression of pro-apoptotic genes; PANDA depletion markedly sensitized human fibroblasts to apoptosis by doxorubicin. These findings suggest potentially widespread roles for promoter lncRNAs in cell-growth control. Nature Genetics 43:616-7, Jun 5, 2011. 51. An epigenetic signature for monoallelic olfactory receptor expression Magklara, Yen, Colquitt, Clowney, Allen, Markenscoff-Papadimitriou, Evans, Kheradpour, Mountoufaris, Carey, Barnea, Kellis, Lomvardas Constitutive heterochromatin is traditionally viewed as the static form of heterochromatin that silences pericentromeric and telomeric repeats in a cell cycle- and differentiation-independent manner. Here, we show that, in the mouse olfactory epithelium, olfactory receptor (OR) genes are marked in a highly dynamic fashion with the molecular hallmarks of constitutive heterochromatin, H3K9me3 and H4K20me3. The cell type and developmentally dependent deposition of these marks along the OR clusters are, most likely, reversed during the process of OR choice to allow for monogenic and monoallelic OR expression. In contrast to the current view of OR choice, our data suggest that OR silencing takes place before OR expression, indicating that it is not the product of an OR-elicited feedback signal. Our findings suggest that chromatin-mediated silencing lays a molecular foundation upon which singular and stochastic selection for gene expression can be applied. Cell 145(4):555-70, May 13, 2011. Epub Apr 28, 2011. 50. Comparative Functional Genomics of the Fission Yeasts Rhind, Chen, Yassour, Thompson, Haas, Habib, Wapinski, Roy, Lin, Heiman, Young, Furuya, Guo, Pidoux, Chen, Robbertse, Goldberg, Aoki, Bayne, Berlin, Desjardins, Dobbs, Dukaj, Fan, Fitzgerald, French, Gujja, Hansen, Keifenheim, Levin, Mosher, Müller, Pfiffner, Priest, Russ, Smialowska, Swoboda, Sykes, Vaughn, Vengrova, Yoder, Zeng, Allshire, Baulcombe, Birren, Brown, Ekwall, Kellis, Leatherwood, Levin, Margalit, Martienssen, Nieduszynski, Spatafora, Friedman, Dalgaard, Baumann, Niki, Regev, Nusbaum The fission yeast clade-comprising Schizosaccharomyces pombe, S. octosporus, S. cryophilus, and S. japonicus-occupies the basal branch of Ascomycete fungi and is an important model of eukaryote biology. A comparative annotation of these genomes identified a near extinction of transposons and the associated innovation of transposon-free centromeres. Expression analysis established that meiotic genes are subject to antisense transcription during vegetative growth, which suggests a mechanism for their tight regulation. In addition, trans-acting regulators control new genes within the context of expanded functional modules for meiosis and stress response. Differences in gene content and regulation also explain why, unlike the Saccharomycotina, fission yeasts cannot use ethanol as a primary carbon source. These analyses elucidate the genome structure and gene regulation of fission yeast and provide tools for investigation across the Schizosaccharomyces clade. Science 332(6032):930-6, Apr 21, 2011 49. Mapping and analysis of chromatin state dynamics in nine human cell types Ernst, Kheradpour, Mikkelsen, Shoresh, Ward, Epstein, Zhang, Wang, Issner, Coyne, Ku, Durham, Kellis, Bernstein Chromatin profiling has emerged as a powerful means for annotating genomic elements and detecting regulatory activity. Here we generate and analyze a compendium of epigenomic maps for nine chromatin marks across nine cell types, in order to systematically characterize cis-regulatory elements, their cell type-specificities, and their functional interactions. We first identify recurrent combinations of histone modifications and use them to annotate diverse regulatory elements including promoters, enhancers, transcripts and insulators in each cell type. We next characterize the dynamics of these elements, revealing meaningful patterns of activity for promoter states and exquisite cell type-selectivity for enhancer states. We define multi-cell activity profiles that reflect the patterns of enhancer state activity across cell types, as well as analogous profiles for gene expression, regulatory motif enrichments, and expression of the corresponding regulators. We use correlations between these profiles to link candidate enhancers to putative target genes, to infer cell type-specific activators and repressors, and to predict and validate functional regulator binding motifs in specific chromatin states. These functional annotations and regulatory predictions enable us to revisit intergenic single-nucleotide polymorphisms (SNPs) associated with human disease in genome-wide association studies (GWAS). We find that for several diseases, topscoring SNPs are precisely positioned within enhancer elements specifically active in relevant cell types. In several cases a disease variant affects a motif instance for one of the predicted causal regulators, thus providing a potential mechanistic explanation for the disease association. Our study presents a general framework for applying multi-cell chromatin state analysis to decipher cis-regulatory connections and their role in health and disease. Nature, doi:10.1038/nature09906, Epub ahead of print: March 23, 2011 48. A Cis-Regulatory Map of the Drosophila Genome Negre, Brown, Ma, Bristow, Miller, Kheradpour, Loriaux, Sealfon, Li, Ishii, Spokony, Chen, Hwang, Wagner, Auburn, Domanus, Shah, Morrison, Zieba, Suchy, Senderowicz, Victorsen, Bild, Grundstad, Hanley, Mannervik, Venken, Bellen, White, Russell, Grossman, Ren, Posakony, Kellis, White Following the sequencing of human and model organism genomes, genome-wide annotation of regulatory information has emerged as a major challenge. Here we describe an initial map of the Drosophila melanogaster regulatory genome based on the developmental dynamics of chromatin modifications and chromatin modifying enzymes, on polymerase occupancy of promoters, on the dynamic binding of enhancer-associated proteins such as the transcriptional co-factor CBP, and on the localization of forty-one site-specific transcription factors at different stages of development. The entire dataset provides protein modification and binding annotations across 94% of the genome along with prediction and validation of 4 classes of regulatory elements: insulators, promoters, silencers and enhancers. This regulatory map reveals several newly discovered properties of genome regulation, including the lack of epigenetic marks at promoters of transiently expressed genes, the association of specific Histone Deacetylases (HDACs) with Polycomb Response Elements, the early role of CBP as a marker of enhancers and the occurence of high-occupancy transcription factor binding sites that correlate with gene expression. Using these data we also generated a combinatorial analysis of transcription factors and DNA sequence motifs that are associated with different sets of developmentally co-expressed genes, providing a database for discovering the sets of regulatory inputs that control regulatory element function. Together, these cis-regulatory annotations serve as a foundation for further detailed analyses of the genomic regulatory code in Drosophila. Nature 471:527-531, March 23, 2011. 47. Comprehensive analysis of the Drosophila melanogaster chromatin landscape differentiates among chromosomes, genes, and regulatory elements Kharchenko, Alekseyenko, Schwartz, Minoda, Riddle, Ernst, Sabo, Larschan, Gorchakov, Gu, Linder-Basso, Plachetka, Shanower, Tolstorukov, Bishop, Canfield, Sandstrom, Thurman, Stamatoyannopoulos, Kellis, Elgin, Kuroda, Pirotta, Karpen We present a genome-wide map of the chromatin landscape for Drosophila melanogaster, based on the distributions of 18 histone modifications and 9 combinatorial patterns identified by computational analysis. Integrative analysis with other genome-wide mapping data (non-histone chromatin proteins, DNaseI hypersensitive sites, GRO-seq, short/long RNA expression) reveals distinct properties of chromosomes, genes, regulatory elements and other functional domains. In addition to highlighting the special identities of the male X and the 4th chromosomes, this analysis identifies distinct chromatin signatures among active genes that are correlated with differences in gene length, exonic structure, regulatory function, and genomic context. It also reveals a diversity of chromatin signatures among Polycomb targets, including a subset with paused RNA polymerase. This systematic profiling and integrative analysis of chromatin signatures provides important insights into the differential packaging of functional elements, and will serve as a valuable resource for future experimental investigations of genome structure and function. Nature 471:480-485, March 23, 2011. Epub head of print: Dec 22, 2010 . 46. Identification of functional elements and regulatory circuits in Drosophila by large-scale data integration The modENCODE Consortium, Roy, Ernst, Kharchenko, Kheradpour, Negre, Eaton, Landolin, Bristow, Ma, Lin, Washietl, Arshinoff, Ay, Meyer, Robine, Washington, Di Stefano, Berezikov, Brown, Brown, Candeias, Carlson, Carr, Jungreis, Marbach, Sealfon, Tolstorukov, Alekseyenko, Artieri, Boley, Booth, Brooks, Dai, Davis, Duff, Feng, Gorchakov, Gu, Henikoff, Kapranov, Li, Li, MacAlpine, Malone, Minoda, Nordman, Okamura, Perry, Powell, Riddle, Sakai, Samsonova, Sandler, Schwartz, Sher, Spokony, Sturgill, van Baren, Will, Wan, Yang, Yu, Feingold, Good, Guyer, Lowdon, Ahmad, Andrews, Berger, Bickel, Brenner, Brent, Cherbas, Elgin, Gingeras, Grossman, Hoskins, Kaufman, Kent, Kuroda, Orr-Weaver, Perrimon, Pirrotta, Posakony, Ren, Russell, Cherbas, Graveley, Lewis, Micklem, Oliver, Park, Celniker, Henikoff, Karpen, Lai, MacAlpine, Stein, White, Kellis Several years after the initial sequencing of the genomes from human and other organisms, the vast majority of each genome remains unannotated, and it is still unclear how to translate genomic information into a functional map of cellular and developmental programs. To address this question, the Drosophila modENCODE project has undertaken a large-scale effort to comprehensively map transcription, regulator binding, chromatin state, replication, and nucleosome properties across a developmental time-course and in multiple cell lines. Here, we report our initial integrative analysis of the first phase of the project, encompassing more than 1000 datasets generated over four years across six production centers. Our integrated annotation enabled the discovery of new proteincoding, non-coding, RNA regulatory, replication, and chromatin elements that more than triple the annotated portion of the genome. We study correlated activity patterns of these elements to infer a functional regulatory network, which we use to predict putative functions for new genes, reveal stage-specific and tissue-specific regulators, and infer predictive models of gene expression. Our results provide a reference annotation that can inform directed experimental and computational studies in Drosophila and related species, and provide a model for systematic data integration towards the comprehensive genomic and functional annotation of any genome, including the human. Science, Dec 24, 2010 . 45b. Error and error mitigation in low-coverage genome assemblies Hubisz, Lin, Kellis, Siepel The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ~2X coverage. Here we examine the extent of sequencing error in these 2X assemblies, and its potential impact in downstream analyses. By comparing 2X assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1-4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download PLoS ONE 6(2):e17034, Feb 14 2011 45. SubMAP: Aligning metabolic pathways with subnetwork mappings Ay, Kellis, Kahveciy We consider the problem of aligning two metabolic pathways. Unlike traditional approaches, we do not restrict the alignment to one-to-one mappings between the molecules (nodes) of the input pathways (graphs). We follow the observation that in nature different organisms can perform the same or similar functions through different sets of reactions and molecules. The number and the topology of the molecules in these alternative sets often vary from one organism to another. With the motivation that an accurate biological alignment should be able to reveal these functionally similar molecule sets across different species, we develop an algorithm that first measures the similarities between different nodes using a mixture of homology and topological similarity. We combine the two metrics by employing an eigenvalue formulation. We then search for an alignment between the two input pathways that maximizes a similarity score, evaluated as the sum of the similarities of the mapped subsets of size at most a given integer k, and also does not contain any conflicting mappings. Here we prove that this maximization is NP-hard by a reduction from Maximum Weight Independent Set (MWIS) problem. We then convert our problem into an instance of MWIS and use an efficient vertex-selection strategy to extract the mappings that constitute our alignment. We name our algorithm SubMAP (Subnetwork Mappings in Alignment of Pathways). We evaluate its accuracy and performance on real datasets. Our empirical results demonstrate that SubMAP can identify biologically-relevant mappings that are missed by traditional alignment methods and, that it is scalable for metabolic pathways of arbitrary topology, including searching for a query pathway of size 70 against the complete KEGG database of 1,842 pathways. Journal of Computational Biology, 12 pages, in press . 44. The NIH Roadmap Epigenomics Mapping Consortium Bernstein, Stamatoyannopoulos, Costello, Ren, Milosavljevic, Meissner, Kellis, Marra, Beaudet, Ecker, Farnham, Hirst, Lander, Mikkelsen, Thomson The NIH Roadmap Epigenomics Mapping Consortium aims to produce a public resource of epigenomic maps for stem cells and primary ex vivo tissues selected to represent the normal counterparts of tissues and organ systems frequently involved in human disease. The maps will detail the genomewide landscapes of DNA methylation, histone modifications and related chromatin features. They are intended to provide a reference for studies of the genetic and epigenetic events that underlie human development, diversity and disease. Here we describe the organizational structure, goals and anticipated deliverables of the consortium. Nature Biotechnology, 3 pages, in press . 43. Optimization of parameters for coverage of low molecular weight proteins. Müller, Kohajda, Findeiß, Stadler, Washietl, Kellis, vonBergen, Kalkhof Proteins with molecular weights of 25 kDa are involved in major biological processes such as ribosome formation, stress adaption (e.g., temperature reduction) and cell cycle control. Despite their importance, the coverage of smaller proteins in standard proteome studies is rather sparse. Here we investigated biochemical and mass spectrometric parameters that influence coverage and validity of identification. The underrepresentation of low molecular weight (LMW) proteins may be attributed to the low numbers of proteolytic peptides formed by tryptic digestion as well as their tendency to be lost in protein separation and concentration/desalting procedures. In a systematic investigation of the LMW proteome of Escherichia coli, a total of 455 LMW proteins (27% of the 1672 listed in the SwissProt protein database) were identified, corresponding to a coverage of 62% of the known cytosolic LMW proteins. Of these proteins, 93 had not yet been functionally classified, and five had not previously been confirmed at the protein level. In this study, the influences of protein extraction (either urea or TFA), proteolytic digestion (solely, and the combined usage of trypsin and AspN as endoproteases) and protein separation (gel- or non-gel-based) were investigated. Compared to the standard procedure based solely on the use of urea lysis buffer, in-gel separation and tryptic digestion, the complementary use of TFA for extraction or endoprotease AspN for proteolysis permits the identification of an extra 72 (32%) and 51 proteins (23%), respectively. Regarding mass spectrometry analysis with an LTQ Orbitrap mass spectrometer, collision-induced fragmentation (CID and HCD) and electron transfer dissociation using the linear ion trap (IT) or the Orbitrap as the analyzer were compared. IT-CID was found to yield the best identification rate, whereas IT-ETD provided almost comparable results in terms of LMW proteome coverage. The high overlap between the proteins identified with IT-CID and IT-ETD allowed the validation of 75% of the identified proteins using this orthogonal fragmentation technique. Furthermore, a new approach to evaluating and improving the completeness of protein databases that utilizes the program RNAcode was introduced and examined. Anal Bioanal Chem. 2010 Aug 28. . PMID: 20803007 42. Discovery and characterization of chromatin states for systematic annotation of the human genome. Ernst, Kellis A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal 'chromatin states' in human T cells, based on recurrent and spatially coherent combinations of chromatin marks. We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, large-scale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function. Nat Biotechnol. 2010 Aug;28(8):817-25. Epub 2010 Jul 25 . PMCID: PMC2919626 PMID: 20657582 41. A Bayesian Approach for Fast and Accurate Gene Tree Reconstruction. Rasmussen, Kellis Recent sequencing and computing advances have enabled phylogenetic analyses to expand to both entire genomes and large clades, thus requiring more efficient and accurate methods designed specifically for the phylogenomic context. Here we present SPIMAP, an efficient Bayesian method for reconstructing gene trees in the presence of a known species tree. We observe many improvements in reconstruction accuracy, achieved by modeling multiple aspects of evolution, including gene duplication and loss rates, speciation times, and correlated substitution rate variation across both species and loci. We have implemented and applied this method on two clades of fully-sequenced species, 12 Drosophila and 16 fungal genomes as well as simulated phylogenies, and find dramatic improvements in reconstruction accuracy as compared to the most popular existing methods, including those that take the species tree into account. We find that reconstruction inaccuracies of traditional phylogenetic methods overestimate the number of duplication and loss events by as much as 2 to 3 fold, while our method achieves significantly higher accuracy. We feel the results and methods presented here will have many important implications for future investigations of gene evolution. Mol Biol Evol. 2010 Jul 25. . PMID: 20660489 40. Sequences to systems. Kellis, Rinn From a chemical standpoint, a cell can seem as a densely intertwined mess of molecules, but their emergent systems-level properties exhibit structure, coordination and order, leading to the orchestrated complexity we call life. As the seventh annual meeting on Systems Biology: Global Regulation of Gene Expression at the Cold Spring Harbor Laboratory showcased, systems biology aims to bridge the vast gap of knowledge between the pairwise molecular interactions of individual cellular components and the large-scale behavior of cellular circuits that result from their synergism. The talks given at the meeting exemplified a new definition of systems biology, where genomic technologies can revolutionize our understanding of classical problems by enabling molecular interactions to be linked to systems behavior in a mechanistic way Genome Biol. 2010;11(5):303. Epub 2010 May 25 . PMCID: PMC2898084 PMID: 20500907 39. A comprehensive map of insulator elements for the Drosophila genome. Negre, Brown, Shah, Kheradpour, Morrison, Henikoff, Feng, Ahmad, Russell, White, Stein, Henikoff, Kellis, White Insulators are DNA sequences that control the interactions among genomic regulatory elements and act as chromatin boundaries. A thorough understanding of their location and function is necessary to address the complexities of metazoan gene regulation. We studied by ChIP-chip the genome-wide binding sites of 6 insulator-associated proteins-dCTCF, CP190, BEAF-32, Su(Hw), Mod(mdg4), and GAF-to obtain the first comprehensive map of insulator elements in Drosophila embryos. We identify over 14,000 putative insulators, including all classically defined insulators. We find two major classes of insulators defined by dCTCF/CP190/BEAF-32 and Su(Hw), respectively. Distributional analyses of insulators revealed that particular sub-classes of insulator elements are excluded between cis-regulatory elements and their target promoters; divide differentially expressed, alternative, and divergent promoters; act as chromatin boundaries; are associated with chromosomal breakpoints among species; and are embedded within active chromatin domains. Together, these results provide a map demarcating the boundaries of gene regulatory units and a framework for understanding insulator function during the development and evolution of Drosophila. PLoS Genet. 2010 Jan 15;6(1):e1000814 . PMCID: PMC2797089 PMID: 20084099 38. The Tasmanian devil transcriptome reveals Schwann cell origins of a clonally transmissible cancer. Murchison, Tovar, Hsu, Bender, Kheradpour, Rebbeck, Obendorf, Conlan, Bahlo, Blizzard, Pyecroft, Kreiss, Kellis, Stark, Harkins, Marshall, Woods, Hannon, Papenfuss The Tasmanian devil, a marsupial carnivore, is endangered because of the emergence of a transmissible cancer known as devil facial tumor disease (DFTD). This fatal cancer is clonally derived and is an allograft transmitted between devils by biting. We performed a large-scale genetic analysis of DFTD with microsatellite genotyping, a mitochondrial genome analysis, and deep sequencing of the DFTD transcriptome and microRNAs. These studies confirm that DFTD is a monophyletic clonally transmissible tumor and suggest that the disease is of Schwann cell origin. On the basis of these results, we have generated a diagnostic marker for DFTD and identify a suite of genes relevant to DFTD pathology and transmission. We provide a genomic data set for the Tasmanian devil that is applicable to cancer diagnosis, disease evolution, and conservation biology. Science. 2010 Jan 1;327(5961):84-7 . PMID: 20044575 37. Motif Discovery in Physiological Datasets: A Methodology for Inferring Predictive Elements. Syed, Stultz, Kellis, Indyk, Guttag In this article, we propose a methodology for identifying predictive physiological patterns in the absence of prior knowledge. We use the principle of conservation to identify activity that consistently precedes an outcome in patients, and describe a two-stage process that allows us to efficiently search for such patterns in large datasets. This involves first transforming continuous physiological signals from patients into symbolic sequences, and then searching for patterns in these reduced representations that are strongly associated with an outcome.Our strategy of identifying conserved activity that is unlikely to have occurred purely by chance in symbolic data is analogous to the discovery of regulatory motifs in genomic datasets. We build upon existing work in this area, generalizing the notion of a regulatory motif and enhancing current techniques to operate robustly on non-genomic data. We also address two significant considerations associated with motif discovery in general: computational efficiency and robustness in the presence of degeneracy and noise. To deal with these issues, we introduce the concept of active regions and new subset-based techniques such as a two-layer Gibbs sampling algorithm. These extensions allow for a framework for information inference, where precursors are identified as approximately conserved activity of arbitrary complexity preceding multiple occurrences of an event.We evaluated our solution on a population of patients who experienced sudden cardiac death and attempted to discover electrocardiographic activity that may be associated with the endpoint of death. To assess the predictive patterns discovered, we compared likelihood scores for motifs in the sudden death population against control populations of normal individuals and those with non-fatal supraventricular arrhythmias. Our results suggest that predictive motif discovery may be able to identify clinically relevant information even in the absence of significant prior knowledge. ACM Trans Knowl Discov Data. 2010 Jan;4(1):2 . PMCID: PMC2923403 PMID: 20730037 36. Unlocking the secrets of the genome. Celniker, Dillon, Gerstein, Gunsalus, Henikoff, Karpen, Kellis, Lai, Lieb, MacAlpine, Micklem, Piano, Snyder, Stein, White, Waterston Despite the successes of genomics, little is known about how genetic information produces complex organisms. A look at the crucial functional elements of fly and worm genomes could change that. Nature. 2009 Jun 18;459(7249):927-30 . PMCID: PMC2843545 PMID: 19536255 35. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes. Pruitt, Harrow, Harte, Wallin, Diekhans, Maglott, Searle, Farrell, Loveland, Ruef, Hart, Suner, Landrum, Aken, Ayling, Baertsch, Fernandez-Banet, Cherry, Curwen, Dicuccio, Kellis, Lee, Lin, Schuster, Shkeda, Amid, Brown, Dukhanina, Frankish, Hart, Maidak, Mudge, Murphy, Murphy, Rajan, Rajput, Riddick, Snow, Steward, Webb, Weber, Wilming, Wu, Birney, Haussler, Hubbard, Ostell, Durbin, Lipman Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions. Genome Res. 2009 Jul;19(7):1316-23. Epub 2009 Jun 4 . PMCID: PMC2704439 PMID: 19498102 34. Evolution of pathogenicity and sexual reproduction in eight Candida genomes. Butler, Rasmussen, Lin, Santos, Sakthikumar, Munro, Rheinbay, Grabherr, Forche, Reedy, Agrafioti, Arnaud, Bates, Brown, Brunke, Costanzo, Fitzpatrick, de, Harris, Hoyer, Hube, Klis, Kodira, Lennard, Logue, Martin, Neiman, Nikolaou, Quail, Quinn, Santos, Schmitzberger, Sherlock, Shah, Silverstein, Skrzypek, Soll, Staggs, Stansfield, Stumpf, Sudbery, Srikantha, Zeng, Berman, Berriman, Heitman, Gow, Lorenz, Birren, Kellis, Cuomo Candida species are the most common cause of opportunistic fungal infection worldwide. Here we report the genome sequences of six Candida species and compare these and related pathogens and non-pathogens. There are significant expansions of cell wall, secreted and transporter gene families in pathogenic species, suggesting adaptations associated with virulence. Large genomic tracts are homozygous in three diploid species, possibly resulting from recent recombination events. Surprisingly, key components of the mating and meiosis pathways are missing from several species. These include major differences at the mating-type loci (MTL); Lodderomyces elongisporus lacks MTL, and components of the a1/2 cell identity determinant were lost in other species, raising questions about how mating and cell types are controlled. Analysis of the CUG leucine-to-serine genetic-code change reveals that 99% of ancestral CUG codons were erased and new ones arose elsewhere. Lastly, we revise the Candida albicans gene catalogue, identifying many new genes. Nature. 2009 Jun 4;459(7247):657-62 . PMCID: PMC2834264 PMID: 19465905 33. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Heintzman, Hon, Hawkins, Kheradpour, Stark, Harp, Ye, Lee, Stuart, Ching, Ching, Antosiewicz-Bourget, Liu, Zhang, Green, Lobanenkov, Stewart, Thomson, Crawford, Kellis, Ren The human body is composed of diverse cell types with distinct functions. Although it is known that lineage specification depends on cell-specific gene expression, which in turn is driven by promoters, enhancers, insulators and other cis-regulatory DNA sequences for each gene, the relative roles of these regulatory elements in this process are not clear. We have previously developed a chromatin-immunoprecipitation-based microarray method (ChIP-chip) to locate promoters, enhancers and insulators in the human genome. Here we use the same approach to identify these elements in multiple cell types and investigate their roles in cell-type-specific gene expression. We observed that the chromatin state at promoters and CTCF-binding at insulators is largely invariant across diverse cell types. In contrast, enhancers are marked with highly cell-type-specific histone modification patterns, strongly correlate to cell-type-specific gene expression programs on a global scale, and are functionally active in a cell-type-specific manner. Our results define over 55,000 potential transcriptional enhancers in the human genome, significantly expanding the current catalogue of human enhancers and highlighting the role of these elements in cell-type-specific gene expression. Nature. 2009 May 7;459(7243):108-12. Epub 2009 Mar 18 . PMCID: PMC2910248 PMID: 19295514 32. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Guttman, Amit, Garber, French, Lin, Feldser, Huarte, Zuk, Carey, Cassady, Cabili, Jaenisch, Mikkelsen, Jacks, Hacohen, Bernstein, Kellis, Regev, Rinn, Lander There is growing recognition that mammalian cells produce many thousands of large intergenic transcripts. However, the functional significance of these transcripts has been particularly controversial. Although there are some well-characterized examples, most (95%) show little evidence of evolutionary conservation and have been suggested to represent transcriptional noise. Here we report a new approach to identifying large non-coding RNAs using chromatin-state maps to discover discrete transcriptional units intervening known protein-coding loci. Our approach identified approximately 1,600 large multi-exonic RNAs across four mouse cell types. In sharp contrast to previous collections, these large intervening non-coding RNAs (lincRNAs) show strong purifying selection in their genomic loci, exonic sequences and promoter regions, with greater than 95% showing clear evolutionary conservation. We also developed a functional genomics approach that assigns putative functions to each lincRNA, demonstrating a diverse range of roles for lincRNAs in processes from embryonic stem cell pluripotency to cell proliferation. We obtained independent functional validation for the predictions for over 100 lincRNAs, using cell-based assays. In particular, we demonstrate that specific lincRNAs are transcriptionally regulated by key transcription factors in these processes such as p53, NFkappaB, Sox2, Oct4 (also known as Pou5f1) and Nanog. Together, these results define a unique collection of functional lincRNAs that are highly conserved and implicated in diverse biological processes. Nature. 2009 Mar 12;458(7235):223-7. Epub 2009 Feb 1 . PMCID: PMC2754849 PMID: 19182780 31. An endogenous small interfering RNA pathway in Drosophila. Czech, Malone, Zhou, Stark, Schlingeheyde, Dus, Perrimon, Kellis, Wohlschlegel, Sachidanandam, Hannon, Brennecke Drosophila endogenous small RNAs are categorized according to their mechanisms of biogenesis and the Argonaute protein to which they bind. MicroRNAs are a class of ubiquitously expressed RNAs of approximately 22 nucleotides in length, which arise from structured precursors through the action of Drosha-Pasha and Dicer-1-Loquacious complexes. These join Argonaute-1 to regulate gene expression. A second endogenous small RNA class, the Piwi-interacting RNAs, bind Piwi proteins and suppress transposons. Piwi-interacting RNAs are restricted to the gonad, and at least a subset of these arises by Piwi-catalysed cleavage of single-stranded RNAs. Here we show that Drosophila generates a third small RNA class, endogenous small interfering RNAs, in both gonadal and somatic tissues. Production of these RNAs requires Dicer-2, but a subset depends preferentially on Loquacious rather than the canonical Dicer-2 partner, R2D2 (ref. 14). Endogenous small interfering RNAs arise both from convergent transcription units and from structured genomic loci in a tissue-specific fashion. They predominantly join Argonaute-2 and have the capacity, as a class, to target both protein-coding genes and mobile elements. These observations expand the repertoire of small RNAs in Drosophila, adding a class that blurs distinctions based on known biogenesis mechanisms and functional roles. Nature. 2008 Jun 5;453(7196):798-802. Epub 2008 May 7 . PMCID: PMC2895258 PMID: 18463631 30. Genome analysis of the platypus reveals unique signatures of evolution. Warren, Hillier, Marshall, Birney, Ponting, Grützner, Belov, Miller, Clarke, Chinwalla, Yang, Heger, Locke, Miethke, Waters, Veyrunes, Fulton, Fulton, Graves, Wallis, Puente, LÃ3pez-OtÃ-n, OrdÃ3ñez, Eichler, Chen, Cheng, Deakin, Alsop, Thompson, Kirby, Papenfuss, Wakefield, Olender, Lancet, Huttley, Smit, Pask, Temple-Smith, Batzer, Walker, Konkel, Harris, Whittington, Wong, Gemmell, Buschiazzo, Vargas, Merkel, Schmitz, Zemann, Churakov, Kriegs, Brosius, Murchison, Sachidanandam, Smith, Hannon, Tsend-Ayush, McMillan, Attenborough, Rens, Ferguson-Smith, Lefevre, Sharp, Nicholas, Ray, Kube, Reinhardt, Pringle, Taylor, Jones, Nixon, Dacheux, Niwa, Sekita, Huang, Stark, Kheradpour, Kellis, Flicek, Chen, Webber, Hardison, Nelson, Hallsworth-Pepin, Delehaunty, Markovic, Minx, Feng, Kremitzki, Mitreva, Glasscock, Wylie, Wohldmann, Thiru, Nhan, Pohl, Smith, Hou, Nefedov, de, Renfree, Mardis, Wilson We present a draft genome sequence of the platypus, Ornithorhynchus anatinus. This monotreme exhibits a fascinating combination of reptilian and mammalian characters. For example, platypuses have a coat of fur adapted to an aquatic lifestyle; platypus females lactate, yet lay eggs; and males are equipped with venom similar to that of reptiles. Analysis of the first monotreme genome aligned these features with genetic innovations. We find that reptile and platypus venom proteins have been co-opted independently from the same gene families; milk protein genes are conserved despite platypuses laying eggs; and immune gene family expansions are directly related to platypus biology. Expansions of protein, non-protein-coding RNA and microRNA families, as well as repeat elements, are identified. Sequencing of this genome now provides a valuable resource for deep mammalian comparative analyses, as well as for monotreme biology and conservation. Nature. 2008 May 8;453(7192):175-83 . PMCID: PMC2803040 PMID: 18464734 29. Conservation of small RNA pathways in platypus. Murchison, Kheradpour, Sachidanandam, Smith, Hodges, Xuan, Kellis, Grützner, Stark, Hannon Small RNA pathways play evolutionarily conserved roles in gene regulation and defense from parasitic nucleic acids. The character and expression patterns of small RNAs show conservation throughout animal lineages, but specific animal clades also show variations on these recurring themes, including species-specific small RNAs. The monotremes, with only platypus and four species of echidna as extant members, represent the basal branch of the mammalian lineage. Here, we examine the small RNA pathways of monotremes by deep sequencing of six platypus and echidna tissues. We find that highly conserved microRNA species display their signature tissue-specific expression patterns. In addition, we find a large rapidly evolving cluster of microRNAs on platypus chromosome X1, which is unique to monotremes. Platypus and echidna testes contain a robust Piwi-interacting (piRNA) system, which appears to be participating in ongoing transposon defense. Genome Res. 2008 Jun;18(6):995-1004. Epub 2008 May 7 . PMCID: PMC2413167 PMID: 18463306 28. Performance and scalability of discriminative metrics for comparative gene identification in 12 Drosophila genomes. Lin, Deoras, Rasmussen, Kellis Comparative genomics of multiple related species is a powerful methodology for the discovery of functional genomic elements, and its power should increase with the number of species compared. Here, we use 12 Drosophila genomes to study the power of comparative genomics metrics to distinguish between protein-coding and non-coding regions. First, we study the relative power of different comparative metrics and their relationship to single-species metrics. We find that even relatively simple multi-species metrics robustly outperform advanced single-species metrics, especially for shorter exons ( or =240 nt), which are common in animal genomes. Moreover, the two capture largely independent features of protein-coding genes, with different sensitivity/specificity trade-offs, such that their combinations lead to even greater discriminatory power. In addition, we study how discovery power scales with the number and phylogenetic distance of the genomes compared. We find that species at a broad range of distances are comparably effective informants for pairwise comparative gene identification, but that these are surpassed by multi-species comparisons at similar evolutionary divergence. In particular, while pairwise discovery power plateaued at larger distances and never outperformed the most advanced single-species metrics, multi-species comparisons continued to benefit even from the most distant species with no apparent saturation. Last, we find that genes in functional categories typically considered fast-evolving can nonetheless be recovered at very high rates using comparative methods. Our results have implications for comparative genomics analyses in any species, including the human. PLoS Comput Biol. 2008 Apr 18;4(4):e1000067 . PMCID: PMC2291194 PMID: 18421375 27. The evolutionary dynamics of the Saccharomyces cerevisiae protein interaction network after duplication. Presser, Elowitz, Kellis, Kishony Gene duplication is an important mechanism in the evolution of protein interaction networks. Duplications are followed by the gain and loss of interactions, rewiring the network at some unknown rate. Because rewiring is likely to change the distribution of network motifs within the duplicated interaction set, it should be possible to study network rewiring by tracking the evolution of these motifs. We have developed a mathematical framework that, together with duplication data from comparative genomic and proteomic studies, allows us to infer the connectivity of the preduplication network and the changes in connectivity over time. We focused on the whole-genome duplication (WGD) event in Saccharomyces cerevisiae. The model allowed us to predict the frequency of intergene interaction before WGD and the post duplication probabilities of interaction gain and loss. We find that the predicted frequency of self-interactions in the preduplication network is significantly higher than that observed in today's network. This could suggest a structural difference between the modern and ancestral networks, preferential addition or retention of interactions between ohnologs, or selective pressure to preserve duplicates of self-interacting proteins. Proc Natl Acad Sci U S A. 2008 Jan 22;105(3):950-4. Epub 2008 Jan 16 . PMCID: PMC2242688 PMID: 18199840 26. A single Hox locus in Drosophila produces functional microRNAs from opposite DNA strands. Stark, Bushati, Jan, Kheradpour, Hodges, Brennecke, Bartel, Cohen, Kellis MicroRNAs (miRNAs) are approximately 22-nucleotide RNAs that are processed from characteristic precursor hairpins and pair to sites in messages of protein-coding genes to direct post-transcriptional repression. Here, we report that the miRNA iab-4 locus in the Drosophila Hox cluster is transcribed convergently from both DNA strands, giving rise to two distinct functional miRNAs. Both sense and antisense miRNA products target neighboring Hox genes via highly conserved sites, leading to homeotic transformations when ectopically expressed. We also report sense/antisense miRNAs in mouse and find antisense transcripts close to many miRNAs in both flies and mammals, suggesting that additional sense/antisense pairs exist. Genes Dev. 2008 Jan 1;22(1):8-13 . PMCID: PMC2151017 PMID: 18172160 25. Distinguishing protein-coding and noncoding genes in the human genome. Clamp, Fry, Kamal, Xie, Cuff, Lin, Kellis, Lindblad-Toh, Lander Although the Human Genome Project was completed 4 years ago, the catalog of human protein-coding genes remains a matter of controversy. Current catalogs list a total of approximately 24,500 putative protein-coding genes. It is broadly suspected that a large fraction of these entries are functionally meaningless ORFs present by chance in RNA transcripts, because they show no evidence of evolutionary conservation with mouse or dog. However, there is currently no scientific justification for excluding ORFs simply because they fail to show evolutionary conservation: the alternative hypothesis is that most of these ORFs are actually valid human genes that reflect gene innovation in the primate lineage or gene loss in the other lineages. Here, we reject this hypothesis by carefully analyzing the nonconserved ORFs-specifically, their properties in other primates. We show that the vast majority of these ORFs are random occurrences. The analysis yields, as a by-product, a major revision of the current human catalogs, cutting the number of protein-coding genes to approximately 20,500. Specifically, it suggests that nonconserved ORFs should be added to the human gene catalog only if there is clear evidence of an encoded protein. It also provides a principled methodology for evaluating future proposed additions to the human gene catalog. Finally, the results indicate that there has been relatively little true innovation in mammalian protein-coding genes. Proc Natl Acad Sci U S A. 2007 Dec 4;104(49):19428-33. Epub 2007 Nov 26 . PMCID: PMC2148306 PMID: 18040051 24. RNA polymerase stalling at developmental control genes in the Drosophila melanogaster embryo. Zeitlinger, Stark, Kellis, Hong, Nechaev, Adelman, Levine, Young It is widely assumed that the key rate-limiting step in gene activation is the recruitment of RNA polymerase II (Pol II) to the core promoter. Although there are well-documented examples in which Pol II is recruited to a gene but stalls, a general role for Pol II stalling in development has not been established. We have carried out comprehensive Pol II chromatin immunoprecipitation microarray (ChIP-chip) assays in Drosophila embryos and identified three distinct Pol II binding behaviors: active (uniform binding across the entire transcription unit), no binding, and stalled (binding at the transcription start site). The notable feature of the approximately 10% genes that are stalled is that they are highly enriched for developmental control genes, which are either repressed or poised for activation during later stages of embryogenesis. We propose that Pol II stalling facilitates rapid temporal and spatial changes in gene activity during development. Nat Genet. 2007 Dec;39(12):1512-6. Epub 2007 Nov 11 . PMCID: PMC2824921 PMID: 17994019 23. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Stark, Lin, Kheradpour, Pedersen, Parts, Carlson, Crosby, Rasmussen, Roy, Deoras, Ruby, Brennecke, Hodges, Hinrichs, Caspi, Paten, Park, Han, Maeder, Polansky, Robson, Aerts, van, Hassan, Gilbert, Eastman, Rice, Weir, Hahn, Park, Dewey, Pachter, Kent, Haussler, Lai, Bartel, Hannon, Kaufman, Eisen, Clark, Smith, Celniker, Gelbart, Kellis Sequencing of multiple related species followed by comparative genomics analysis constitutes a powerful approach for the systematic understanding of any genome. Here, we use the genomes of 12 Drosophila species for the de novo discovery of functional elements in the fly. Each type of functional element shows characteristic patterns of change, or 'evolutionary signatures', dictated by its precise selective constraints. Such signatures enable recognition of new protein-coding genes and exons, spurious and incorrect gene annotations, and numerous unusual gene structures, including abundant stop-codon readthrough. Similarly, we predict non-protein-coding RNA genes and structures, and new microRNA (miRNA) genes. We provide evidence of miRNA processing and functionality from both hairpin arms and both DNA strands. We identify several classes of pre- and post-transcriptional regulatory motifs, and predict individual motif instances with high confidence. We also study how discovery power scales with the divergence and number of species compared, and we provide general guidelines for comparative studies. Nature. 2007 Nov 8;450(7167):219-32 . PMCID: PMC2474711 PMID: 17994088 22. Evolution of genes and genomes on the Drosophila phylogeny. Drosophila, Clark, Eisen, Smith, Bergman, Oliver, Markow, Kaufman, Kellis, Gelbart, Iyer, Pollard, Sackton, Larracuente, Singh, Abad, Abt, Adryan, Aguade, Akashi, Anderson, Aquadro, Ardell, Arguello, Artieri, Barbash, Barker, Barsanti, Batterham, Batzoglou, Begun, Bhutkar, Blanco, Bosak, Bradley, Brand, Brent, Brooks, Brown, Butlin, Caggese, Calvi, Bernardo, Caspi, Castrezana, Celniker, Chang, Chapple, Chatterji, Chinwalla, Civetta, Clifton, Comeron, Costello, Coyne, Daub, David, Delcher, Delehaunty, Do, Ebling, Edwards, Eickbush, Evans, Filipski, Findeiss, Freyhult, Fulton, Fulton, Garcia, Gardiner, Garfield, Garvin, Gibson, Gilbert, Gnerre, Godfrey, Good, Gotea, Gravely, Greenberg, Griffiths-Jones, Gross, Guigo, Gustafson, Haerty, Hahn, Halligan, Halpern, Halter, Han, Heger, Hillier, Hinrichs, Holmes, Hoskins, Hubisz, Hultmark, Huntley, Jaffe, Jagadeeshan, Jeck, Johnson, Jones, Jordan, Karpen, Kataoka, Keightley, Kheradpour, Kirkness, Koerich, Kristiansen, Kudrna, Kulathinal, Kumar, Kwok, Lander, Langley, Lapoint, Lazzaro, Lee, Levesque, Li, Lin, Lin, Lindblad-Toh, Llopart, Long, Low, Lozovsky, Lu, Luo, Machado, Makalowski, Marzo, Matsuda, Matzkin, McAllister, McBride, McKernan, McKernan, Mendez-Lago, Minx, Mollenhauer, Montooth, Mount, Mu, Myers, Negre, Newfeld, Nielsen, Noor, O'Grady, Pachter, Papaceit, Parisi, Parisi, Parts, Pedersen, Pesole, Phillippy, Ponting, Pop, Porcelli, Powell, Prohaska, Pruitt, Puig, Quesneville, Ram, Rand, Rasmussen, Reed, Reenan, Reily, Remington, Rieger, Ritchie, Robin, Rogers, Rohde, Rozas, Rubenfield, Ruiz, Russo, Salzberg, Sanchez-Gracia, Saranga, Sato, Schaeffer, Schatz, Schlenke, Schwartz, Segarra, Singh, Sirot, Sirota, Sisneros, Smith, Smith, Spieth, Stage, Stark, Stephan, Strausberg, Strempel, Sturgill, Sutton, Sutton, Tao, Teichmann, Tobari, Tomimura, Tsolas, Valente, Venter, Venter, Vicario, Vieira, Vilella, Villasante, Walenz, Wang, Wasserman, Watts, Wilson, Wilson, Wing, Wolfner, Wong, Wong, Wu, Wu, Yamamoto, Yang, Yang, Yorke, Yoshida, Zdobnov, Zhang, Zhang, Zimin, Baldwin, Abdouelleil, Abdulkadir, Abebe, Abera, Abreu, Acer, Aftuck, Alexander, An, Anderson, Anderson, Arachi, Azer, Bachantsang, Barry, Bayul, Berlin, Bessette, Bloom, Blye, Boguslavskiy, Bonnet, Boukhgalter, Bourzgui, Brown, Cahill, Channer, Cheshatsang, Chuda, Citroen, Collymore, Cooke, Costello, D'Aco, Daza, De, DeGray, DeMaso, Dhargay, Dooley, Dooley, Doricent, Dorje, Dorjee, Dupes, Elong, Falk, Farina, Faro, Ferguson, Fisher, Foley, Franke, Friedrich, Gadbois, Gearin, Gearin, Giannoukos, Goode, Graham, Grandbois, Grewal, Gyaltsen, Hafez, Hagos, Hall, Henson, Hollinger, Honan, Huard, Hughes, Hurhula, Husby, Kamat, Kanga, Kashin, Khazanovich, Kisner, Lance, Lara, Lee, Lennon, Letendre, LeVine, Lipovsky, Liu, Liu, Liu, Lokyitsang, Lokyitsang, Lubonja, Lui, MacDonald, Magnisalis, Maru, Matthews, McCusker, McDonough, Mehta, Meldrim, Meneus, Mihai, Mihalev, Mihova, Mittelman, Mlenga, Montmayeur, Mulrain, Navidi, Naylor, Negash, Nguyen, Nguyen, Nicol, Norbu, Norbu, Novod, O'Neill, Osman, Markiewicz, Oyono, Patti, Phunkhang, Pierre, Priest, Raghuraman, Rege, Reyes, Rise, Rogov, Ross, Ryan, Settipalli, Shea, Sherpa, Shi, Shih, Sparrow, Spaulding, Stalker, Stange-Thomann, Stavropoulos, Stone, Strader, Tesfaye, Thomson, Thoulutsang, Thoulutsang, Topham, Topping, Tsamla, Vassiliev, Vo, Wangchuk, Wangdi, Weiand, Wilkinson, Wilson, Yadav, Young, Yu, Zembek, Zhong, Zimmer, Zwirko, Jaffe, Alvarez, Brockman, Butler, Chin, Gnerre, Grabherr, Kleber, Mauceli, MacCallum Comparative analysis of multiple genomes in a phylogenetic framework dramatically improves the precision and sensitivity of evolutionary inference, producing more robust results than single-genome analyses can provide. The genomes of 12 Drosophila species, ten of which are presented here for the first time (sechellia, simulans, yakuba, erecta, ananassae, persimilis, willistoni, mojavensis, virilis and grimshawi), illustrate how rates and patterns of sequence divergence across taxa can illuminate evolutionary processes on a genomic scale. These genome sequences augment the formidable genetic tools that have made Drosophila melanogaster a pre-eminent model for animal genetics, and will further catalyse fundamental research on mechanisms of development, cell biology, genetics, disease, neurobiology, behaviour, physiology and evolution. Despite remarkable similarities among these Drosophila species, we identified many putatively non-neutral changes in protein-coding genes, non-coding RNA genes, and cis-regulatory regions. These may prove to underlie differences in the ecology and behaviour of these diverse species. Nature. 2007 Nov 8;450(7167):203-18 . PMCID: PMC2919768 PMID: 17994087 21. Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes. Rasmussen, Kellis Comparative genomics provides a general methodology for discovering functional DNA elements and understanding their evolution. The availability of many related genomes enables more powerful analyses, but requires rigorous phylogenetic methods to resolve orthologous genes and regions. Here, we use 12 recently sequenced Drosophila genomes and nine fungal genomes to address the problem of accurate gene-tree reconstruction across many complete genomes. We show that existing phylogenetic methods that treat each gene tree in isolation show large-scale inaccuracies, largely due to insufficient phylogenetic information in individual genes. However, we find that gene trees exhibit common properties that can be exploited for evolutionary studies and accurate phylogenetic reconstruction. Evolutionary rates can be decoupled into gene-specific and species-specific components, which can be learned across complete genomes. We develop a phylogenetic reconstruction methodology that exploits these properties and achieves significantly higher accuracy, addressing the species-level heterotachy and enabling studies of gene evolution in the context of species evolution. Genome Res. 2007 Dec;17(12):1932-42. Epub 2007 Nov 7 . PMCID: PMC2099600 PMID: 17989260 20. Reliable prediction of regulator targets using 12 Drosophila genomes. Kheradpour, Stark, Roy, Kellis Gene expression is regulated pre- and post-transcriptionally via cis-regulatory DNA and RNA motifs. Identification of individual functional instances of such motifs in genome sequences is a major goal for inferring regulatory networks yet has been hampered due to the motifs' short lengths that lead to many chance matches and poor signal-to-noise ratios. In this paper, we develop a general methodology for the comparative identification of functional motif instances across many related species, using a phylogenetic framework that accounts for the evolutionary relationships between species, allows for motif movements, and is robust against missing data due to artifacts in sequencing, assembly, or alignment. We also provide a robust statistical framework for evaluating motif confidence, which enables us to translate evolutionary conservation into a confidence measure for each motif instance, correcting for varying motif length, composition, and background conservation of the target regions. We predict targets of fly transcription factors and miRNAs in alignments of 12 recently sequenced Drosophila species. When compared to extensive genome-wide experimental data, predicted targets are of high quality, matching and surpassing ChIP-chip microarrays and recovering miRNA targets with high sensitivity. The resulting regulatory network suggests significant redundancy between pre- and post-transcriptional regulation of gene expression. Genome Res. 2007 Dec;17(12):1919-31. Epub 2007 Nov 7 . PMCID: PMC2099599 PMID: 17989251 19. Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes. Stark, Kheradpour, Parts, Brennecke, Hodges, Hannon, Kellis MicroRNAs (miRNAs) are short regulatory RNAs that inhibit target genes by complementary binding in 3' untranslated regions (3' UTRs). They are one of the most abundant classes of regulators, targeting a large fraction of all genes, making their comprehensive study a requirement for understanding regulation and development. Here we use 12 Drosophila genomes to define structural and evolutionary signatures of miRNA hairpins, which we use for their de novo discovery. We predict 41 novel miRNA genes, which encompass many unique families, and 28 of which are validated experimentally. We also define signals for the precise start position of mature miRNAs, which suggest corrections of previously known miRNAs, often leading to drastic changes in their predicted target spectrum. We show that miRNA discovery power scales with the number and divergence of species compared, suggesting that such approaches can be successful in human as dozens of mammalian genomes become available. Interestingly, for some miRNAs sense and anti-sense hairpins score highly and mature miRNAs from both strands can indeed be found in vivo. Similarly, miRNAs with weak 5' end predictions show increased in vivo processing of multiple alternate 5' ends and have fewer predicted targets. Lastly, we show that several miRNA star sequences score highly and are likely functional. For mir-10 in particular, both arms show abundant processing, and both show highly conserved target sites in Hox genes, suggesting a possible cooperation of the two arms, and their role as a master Hox regulator. Genome Res. 2007 Dec;17(12):1865-79. Epub 2007 Nov 7 . PMCID: PMC2099594 PMID: 17989255 18. Evolution, biogenesis, expression, and target predictions of a substantially expanded set of Drosophila microRNAs. Ruby, Stark, Johnston, Kellis, Bartel, Lai MicroRNA (miRNA) genes give rise to small regulatory RNAs in a wide variety of organisms. We used computational methods to predict miRNAs conserved among Drosophila species and large-scale sequencing of small RNAs from Drosophila melanogaster to experimentally confirm and complement these predictions. In addition to validating 20 of our top 45 predictions for novel miRNA loci, the large-scale sequencing identified many miRNAs that had not been predicted. In total, 59 novel genes were identified, increasing our tally of confirmed fly miRNAs to 148. The large-scale sequencing also refined the identities of previously known miRNAs and provided insights into their biogenesis and expression. Many miRNAs were expressed in particular developmental contexts, with a large cohort of miRNAs expressed primarily in imaginal discs. Conserved miRNAs typically were expressed more broadly and robustly than were nonconserved miRNAs, and those conserved miRNAs with more restricted expression tended to have fewer predicted targets than those expressed more broadly. Predicted targets for the expanded set of microRNAs substantially increased and revised the miRNA-target relationships that appear conserved among the fly species. Insights were also provided into miRNA gene evolution, including evidence for emergent regulatory function deriving from the opposite arm of the miRNA hairpin, exemplified by mir-10, and even the opposite strand of the DNA, exemplified by mir-iab-4. Genome Res. 2007 Dec;17(12):1850-64. Epub 2007 Nov 7 . PMCID: PMC2099593 PMID: 17989254 17. Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes. Lin, Carlson, Crosby, Matthews, Yu, Park, Wan, Schroeder, Gramates, St, Roark, Wiley, Kulathinal, Zhang, Myrick, Antone, Celniker, Gelbart, Kellis The availability of sequenced genomes from 12 Drosophila species has enabled the use of comparative genomics for the systematic discovery of functional elements conserved within this genus. We have developed quantitative metrics for the evolutionary signatures specific to protein-coding regions and applied them genome-wide, resulting in 1193 candidate new protein-coding exons in the D. melanogaster genome. We have reviewed these predictions by manual curation and validated a subset by directed cDNA screening and sequencing, revealing both new genes and new alternative splice forms of known genes. We also used these evolutionary signatures to evaluate existing gene annotations, resulting in the validation of 87% of genes lacking descriptive names and identifying 414 poorly conserved genes that are likely to be spurious predictions, noncoding, or species-specific genes. Furthermore, our methods suggest a variety of refinements to hundreds of existing gene models, such as modifications to translation start codons and exon splice boundaries. Finally, we performed directed genome-wide searches for unusual protein-coding structures, discovering 149 possible examples of stop codon readthrough, 125 new candidate ORFs of polycistronic mRNAs, and several candidate translational frameshifts. These results affect 10% of annotated fly genes and demonstrate the power of comparative genomics to enhance our understanding of genome organization, even in a model organism as intensively studied as Drosophila melanogaster. Genome Res. 2007 Dec;17(12):1823-36. Epub 2007 Nov 7 . PMCID: PMC2099591 PMID: 17989253 16. Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites. Xie, Mikkelsen, Gnirke, Lindblad-Toh, Kellis, Lander Conserved noncoding elements (CNEs) constitute the majority of sequences under purifying selection in the human genome, yet their function remains largely unknown. Experimental evidence suggests that many of these elements play regulatory roles, but little is known about regulatory motifs contained within them. Here we describe a systematic approach to discover and characterize regulatory motifs within mammalian CNEs by searching for long motifs (12-22 nt) with significant enrichment in CNEs and studying their biochemical and genomic properties. Our analysis identifies 233 long motifs (LMs), matching a total of approximately 60,000 conserved instances across the human genome. These motifs include 16 previously known regulatory elements, such as the histone 3'-UTR motif and the neuron-restrictive silencer element, as well as striking examples of novel functional elements. The most highly enriched motif (LM1) corresponds to the X-box motif known from yeast and nematode. We show that it is bound by the RFX1 protein and identify thousands of conserved motif instances, suggesting a broad role for the RFX family in gene regulation. A second group of motifs (LM2*) does not match any previously known motif. We demonstrate by biochemical and computational methods that it defines a binding site for the CTCF protein, which is involved in insulator function to limit the spread of gene activation. We identify nearly 15,000 conserved sites that likely serve as insulators, and we show that nearby genes separated by predicted CTCF sites show markedly reduced correlation in gene expression. These sites may thus partition the human genome into domains of expression. Proc Natl Acad Sci U S A. 2007 Apr 24;104(17):7145-50. Epub 2007 Apr 18 . PMCID: PMC1852749 PMID: 17442748 15. Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila. Brennecke, Aravin, Stark, Dus, Kellis, Sachidanandam, Hannon Drosophila Piwi-family proteins have been implicated in transposon control. Here, we examine piwi-interacting RNAs (piRNAs) associated with each Drosophila Piwi protein and find that Piwi and Aubergine bind RNAs that are predominantly antisense to transposons, whereas Ago3 complexes contain predominantly sense piRNAs. As in mammals, the majority of Drosophila piRNAs are derived from discrete genomic loci. These loci comprise mainly defective transposon sequences, and some have previously been identified as master regulators of transposon activity. Our data suggest that heterochromatic piRNA loci interact with potentially active, euchromatic transposons to form an adaptive system for transposon control. Complementary relationships between sense and antisense piRNA populations suggest an amplification loop wherein each piRNA-directed cleavage event generates the 5' end of a new piRNA. Thus, sense piRNAs, formed following cleavage of transposon mRNAs may enhance production of antisense piRNAs, complementary to active elements, by directing cleavage of transcripts from master control loci. Cell. 2007 Mar 23;128(6):1089-103. Epub 2007 Mar 8 . PMID: 17346786 14. Whole-genome ChIP-chip analysis of Dorsal, Twist, and Snail suggests integration of diverse patterning processes in the Drosophila embryo. Zeitlinger, Zinzen, Stark, Kellis, Zhang, Young, Levine Genetic studies have identified numerous sequence-specific transcription factors that control development, yet little is known about their in vivo distribution across animal genomes. We determined the genome-wide occupancy of the dorsoventral (DV) determinants Dorsal, Twist, and Snail in the Drosophila embryo using chromatin immunoprecipitation coupled with microarray analysis (ChIP-chip). The in vivo binding of these proteins correlate tightly with the limits of known enhancers. Our analysis predicts substantially more target genes than previous estimates, and includes Dpp signaling components and anteroposterior (AP) segmentation determinants. Thus, the ChIP-chip data uncover a much larger than expected regulatory network, which integrates diverse patterning processes during development. Genes Dev. 2007 Feb 15;21(4):385-90 . PMCID: PMC1804326 PMID: 17322397 13. Network motif discovery using motif enumeration and symmetry conditions Grochow, Kellis The study of biological networks and network motifs can yield significant new insights into systems biology. Previous methods of discovering network motifs - network-centric subgraph enumeration and sampling - have been limited to motifs of 6 to 8 nodes, revealing only the smallest network components. New methods are necessary to identify larger network sub-structures and functional motifs. Here we present a novel algorithm for discovering large network motifs that achieves these goals, based on a novel symmetry-breaking technique, which eliminates repeated isomorphism testing, leading to an exponential speed-up over previous methods. This technique is made possible by reversing the traditional network-based search at the heart of the algorithm to a motif-based search, which also eliminates the need to store all motifs of a given size and enables parallelization and scaling. Additionally, our method enables us to study the clustering properties of discovered motifs, revealing even larger network elements. We apply this algorithm to the protein-protein interaction network and transcription regulatory network of S. cerevisiae, and discover several large network motifs, which were previously inaccessible to existing methods, including a 29-node cluster of 15-node motifs corresponding to the key transcription machinery of S. cerevisiae. Lecture Notes in Bioinformatics, 2007 . 12. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Lindblad-Toh, Wade, Mikkelsen, Karlsson, Jaffe, Kamal, Clamp, Chang, Kulbokas, Zody, Mauceli, Xie, Breen, Wayne, Ostrander, Ponting, Galibert, Smith, DeJong, Kirkness, Alvarez, Biagi, Brockman, Butler, Chin, Cook, Cuff, Daly, DeCaprio, Gnerre, Grabherr, Kellis, Kleber, Bardeleben, Goodstadt, Heger, Hitte, Kim, Koepfli, Parker, Pollinger, Searle, Sutter, Thomas, Webber, Baldwin, Abebe, Abouelleil, Aftuck, Ait-Zahra, Aldredge, Allen, An, Anderson, Antoine, Arachchi, Aslam, Ayotte, Bachantsang, Barry, Bayul, Benamara, Berlin, Bessette, Blitshteyn, Bloom, Blye, Boguslavskiy, Bonnet, Boukhgalter, Brown, Cahill, Calixte, Camarata, Cheshatsang, Chu, Citroen, Collymore, Cooke, Dawoe, Daza, Decktor, DeGray, Dhargay, Dooley, Dooley, Dorje, Dorjee, Dorris, Duffey, Dupes, Egbiremolen, Elong, Falk, Farina, Faro, Ferguson, Ferreira, Fisher, FitzGerald, Foley, Foley, Franke, Friedrich, Gage, Garber, Gearin, Giannoukos, Goode, Goyette, Graham, Grandbois, Gyaltsen, Hafez, Hagopian, Hagos, Hall, Healy, Hegarty, Honan, Horn, Houde, Hughes, Hunnicutt, Husby, Jester, Jones, Kamat, Kanga, Kells, Khazanovich, Kieu, Kisner, Kumar, Lance, Landers, Lara, Lee, Leger, Lennon, Leuper, LeVine, Liu, Liu, Lokyitsang, Lokyitsang, Lui, Macdonald, Major, Marabella, Maru, Matthews, McDonough, Mehta, Meldrim, Melnikov, Meneus, Mihalev, Mihova, Miller, Mittelman, Mlenga, Mulrain, Munson, Navidi, Naylor, Nguyen, Nguyen, Nguyen, Nguyen, Nicol, Norbu, Norbu, Novod, Nyima, Olandt, O'Neill, O'Neill, Osman, Oyono, Patti, Perrin, Phunkhang, Pierre, Priest, Rachupka, Raghuraman, Rameau, Ray, Raymond, Rege, Rise, Rogers, Rogov, Sahalie, Settipalli, Sharpe, Shea, Sheehan, Sherpa, Shi, Shih, Sloan, Smith, Sparrow, Stalker, Stange-Thomann, Stavropoulos, Stone, Stone, Sykes, Tchuinga, Tenzing, Tesfaye, Thoulutsang, Thoulutsang, Topham, Topping, Tsamla, Vassiliev, Venkataraman, Vo, Wangchuk, Wangdi, Weiand, Wilkinson, Wilson, Yadav, Yang, Yang, Young, Yu, Zainoun, Zembek, Zimmer, Lander Here we report a high-quality draft genome sequence of the domestic dog (Canis familiaris), together with a dense map of single nucleotide polymorphisms (SNPs) across breeds. The dog is of particular interest because it provides important evolutionary information and because existing breeds show great phenotypic diversity for morphological, physiological and behavioural traits. We use sequence comparison with the primate and rodent lineages to shed light on the structure and evolution of genomes and genes. Notably, the majority of the most highly conserved non-coding sequences in mammalian genomes are clustered near a small subset of genes with important roles in development. Analysis of SNPs reveals long-range haplotypes across the entire dog genome, and defines the nature of genetic diversity within and across breeds. The current SNP map now makes it possible for genome-wide association studies to identify genes responsible for diseases and traits, with important consequences for human and companion animal health. Nature. 2005 Dec 8;438(7069):803-19 . PMID: 16341006 11. Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals. Xie, Lu, Kulbokas, Golub, Mootha, Lindblad-Toh, Lander, Kellis Comprehensive identification of all functional elements encoded in the human genome is a fundamental need in biomedical research. Here, we present a comparative analysis of the human, mouse, rat and dog genomes to create a systematic catalogue of common regulatory motifs in promoters and 3' untranslated regions (3' UTRs). The promoter analysis yields 174 candidate motifs, including most previously known transcription-factor binding sites and 105 new motifs. The 3'-UTR analysis yields 106 motifs likely to be involved in post-transcriptional regulation. Nearly one-half are associated with microRNAs (miRNAs), leading to the discovery of many new miRNA genes and their likely target genes. Our results suggest that previous estimates of the number of human miRNA genes were low, and that miRNAs regulate at least 20% of human genes. The overall results provide a systematic view of gene regulation in the human, which will be refined as additional mammalian genomes become available. Nature. 2005 Mar 17;434(7031):338-45. Epub 2005 Feb 27 . PMCID: PMC2923337 PMID: 15735639 10. Large-scale discovery and validation of functional elements in the human genome. Bernstein, Kellis Computational and experimental genomics researchers convened at Cold Spring Harbor Laboratory at the end of 2004 to address the ambitious goal of identifying all the functional elements in the human genome. The functional elements discussed at the meeting included protein-coding genes, regulatory elements, RNA genes and DNA sequences that dictate chromosome structure or replication. The presentations described diverse approaches to the problem, ranging from innovative comparative genomic methods to highthroughput functional assays designed to identify and validate such elements. Genome Biol. 2005;6(3):312. Epub 2005 Mar 1 . PMCID: PMC1088940 PMID: 15774039 9. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Jaillon, Aury, Brunet, Petit, Stange-Thomann, Mauceli, Bouneau, Fischer, Ozouf-Costaz, Bernot, Nicaud, Jaffe, Fisher, Lutfalla, Dossat, Segurens, Dasilva, Salanoubat, Levy, Boudet, Castellano, Anthouard, Jubin, Castelli, Katinka, Vacherie, Biémont, Skalli, Cattolico, Poulain, De, Cruaud, Duprat, Brottier, Coutanceau, Gouzy, Parra, Lardier, Chapple, McKernan, McEwan, Bosak, Kellis, Volff, GuigÃ3, Zody, Mesirov, Lindblad-Toh, Birren, Nusbaum, Kahn, Robinson-Rechavi, Laudet, Schachter, Quétier, Saurin, Scarpelli, Wincker, Lander, Weissenbach, Roest Tetraodon nigroviridis is a freshwater puffer fish with the smallest known vertebrate genome. Here, we report a draft genome sequence with long-range linkage and substantial anchoring to the 21 Tetraodon chromosomes. Genome analysis provides a greatly improved fish gene catalogue, including identifying key genes previously thought to be absent in fish. Comparison with other vertebrates and a urochordate indicates that fish proteins have diverged markedly faster than their mammalian homologues. Comparison with the human genome suggests approximately 900 previously unannotated human genes. Analysis of the Tetraodon and human genomes shows that whole-genome duplication occurred in the teleost fish lineage, subsequent to its divergence from mammals. The analysis also makes it possible to infer the basic structure of the ancestral bony vertebrate genome, which was composed of 12 chromosomes, and to reconstruct much of the evolutionary history of ancient and recent chromosome rearrangements leading to the modern human karyotype. Nature. 2004 Oct 21;431(7011):946-57 . PMID: 15496914 8. Transcriptional regulatory code of a eukaryotic genome. Harbison, Gordon, Lee, Rinaldi, Macisaac, Danford, Hannett, Tagne, Reynolds, Yoo, Jennings, Zeitlinger, Pokholok, Kellis, Rolfe, Takusagawa, Lander, Gifford, Fraenkel, Young DNA-binding transcriptional regulators interpret the genome's regulatory code by binding to specific sequences to induce or repress gene expression. Comparative genomics has recently been used to identify potential cis-regulatory sequences within the yeast genome on the basis of phylogenetic conservation, but this information alone does not reveal if or when transcriptional regulators occupy these binding sites. We have constructed an initial map of yeast's transcriptional regulatory code by identifying the sequence elements that are bound by regulators under various conditions and that are conserved among Saccharomyces species. The organization of regulatory elements in promoters and the environment-dependent use of these elements by regulators are discussed. We find that environment-specific use of regulatory elements predicts mechanistic models for the function of a large population of yeast's transcriptional regulators. Nature. 2004 Sep 2;431(7004):99-104 . PMID: 15343339 7. The changing face of genomics. Kellis The annual meeting on Advances in Genome Biology and Technology was very different this year - in contrast to previous years, only a handful of talks covered the latest large-scale sequencing projects and the next species to be sequenced. This meeting took for granted that we can sequence, assemble and align complete genomes - achievements that only a few years ago seemed daunting, if not unthinkable. The focus of the meeting has instead shifted towards the new challenges in genomics, particularly in the areas of gene regulation, cell dynamics and genome evolution. Genome Biol. 2004;5(5):324. Epub 2004 Apr 30 . 6. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Kellis, Birren, Lander Whole-genome duplication followed by massive gene loss and specialization has long been postulated as a powerful mechanism of evolutionary innovation. Recently, it has become possible to test this notion by searching complete genome sequence for signs of ancient duplication. Here, we show that the yeast Saccharomyces cerevisiae arose from ancient whole-genome duplication, by sequencing and analysing Kluyveromyces waltii, a related yeast species that diverged before the duplication. The two genomes are related by a 1:2 mapping, with each region of K. waltii corresponding to two regions of S. cerevisiae, as expected for whole-genome duplication. This resolves the long-standing controversy on the ancestry of the yeast genome, and makes it possible to study the fate of duplicated genes directly. Strikingly, 95% of cases of accelerated evolution involve only one member of a gene pair, providing strong support for a specific model of evolution, and allowing us to distinguish ancestral and derived functions. Nature. 2004 Apr 8;428(6983):617-24. Epub 2004 Mar 7 . PMID: 15004568 5. Methods in comparative genomics: genome correspondence, gene identification and regulatory motif discovery. Kellis, Patterson, Birren, Berger, Lander In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae, and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genomewide comparative analysis allowed the identification of functionally important sequences, both coding and noncoding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. (1) We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90% of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change. (2) We present methods for the identification of protein-coding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10% of previously annotated genes) and refining the gene structure of hundreds of genes. (3) We present novel methods for the systematic de novo identification of regulatory motifs. The methods do not rely on previous knowledge of gene function and in that way differ from the current literature on computational motif discovery. Based on genomewide conservation patterns of known motifs, we developed three conservation criteria that we used to discover novel motifs. We used an enumeration approach to select strongly conserved motif cores, which we extended and collapsed into a small number of candidate regulatory motifs. These include most previously known regulatory motifs as well as several noteworthy novel motifs. The majority of discovered motifs are enriched in functionally related genes, allowing us to infer a candidate function for novel motifs. Our results demonstrate the power of comparative genomics to further our understanding of any species. Our methods are validated by the extensive experimental knowledge in yeast and will be invaluable in the study of complex genomes like that of the human. J Comput Biol. 2004;11(2-3):319-55 . PMID: 15285895 4. Position specific variation in the rate of evolution in transcription factor binding sites. Moses, Chiang, Kellis, Lander, Eisen The binding sites of sequence specific transcription factors are an important and relatively well-understood class of functional non-coding DNAs. Although a wide variety of experimental and computational methods have been developed to characterize transcription factor binding sites, they remain difficult to identify. Comparison of non-coding DNA from related species has shown considerable promise in identifying these functional non-coding sequences, even though relatively little is known about their evolution. Here we analyse the genome sequences of the budding yeasts Saccharomyces cerevisiae, S. bayanus, S. paradoxus and S. mikatae to study the evolution of transcription factor binding sites. As expected, we find that both experimentally characterized and computationally predicted binding sites evolve slower than surrounding sequence, consistent with the hypothesis that they are under purifying selection. We also observe position-specific variation in the rate of evolution within binding sites. We find that the position-specific rate of evolution is positively correlated with degeneracy among binding sites within S. cerevisiae. We test theoretical predictions for the rate of evolution at positions where the base frequencies deviate from background due to purifying selection and find reasonable agreement with the observed rates of evolution. Finally, we show how the evolutionary characteristics of real binding motifs can be used to distinguish them from artefacts of computational motif finding algorithms. CONCLUSION: As has been observed for protein sequences, the rate of evolution in transcription factor binding sites varies with position, suggesting that some regions are under stronger functional constraint than others. This variation likely reflects the varying importance of different positions in the formation of the protein-DNA complex. The characterization of the pattern of evolution in known binding sites will likely contribute to the effective use of comparative sequence data in the identification of transcription factor binding sites and is an important step toward understanding the evolution of functional non-coding DNA. BMC Evol Biol. 2003 Aug 28;3:19. Epub 2003 Aug 28 . PMCID: PMC212491 PMID: 12946282 3. Phylogenetically and spatially conserved word pairs associated with gene-expression changes in yeasts. Chiang, Moses, Kellis, Lander, Eisen Transcriptional regulation in eukaryotes often involves multiple transcription factors binding to the same transcription control region, and to understand the regulatory content of eukaryotic genomes it is necessary to consider the co-occurrence and spatial relationships of individual binding sites. The determination of conserved sequences (often known as phylogenetic footprinting) has identified individual transcription factor binding sites. We extend this concept of functional conservation to higher-order features of transcription control regions. We used the genome sequences of four yeast species of the genus Saccharomyces to identify sequences potentially involved in multifactorial control of gene expression. We found 989 potential regulatory 'templates': pairs of hexameric sequences that are jointly conserved in transcription regulatory regions and also exhibit non-random relative spacing. Many of the individual sequences in these templates correspond to known transcription factor binding sites, and the sets of genes containing a particular template in their transcription control regions tend to be differentially expressed in conditions where the corresponding transcription factors are known to be active. The incorporation of word pairs to define sequence features yields more specific predictions of average expression profiles and more informative regression models for genome-wide expression data than considering sequence conservation alone. CONCLUSIONS: The incorporation of both joint conservation and spacing constraints of sequence pairs predicts groups of target genes that are specific for common patterns of gene expression. Our work suggests that positional information, especially the relative spacing between transcription factor binding sites, may represent a common organizing principle of transcription control regions. Genome Biol. 2003;4(7):R43. Epub 2003 Jun 26 . PMCID: PMC193630 PMID: 12844359 2. Sequencing and comparison of yeast species to identify genes and regulatory elements. Kellis, Patterson, Endrizzi, Birren, Lander Identifying the functional elements encoded in a genome is one of the principal challenges in modern biology. Comparative genomics should offer a powerful, general approach. Here, we present a comparative analysis of the yeast Saccharomyces cerevisiae based on high-quality draft sequences of three related species (S. paradoxus, S. mikatae and S. bayanus). We first aligned the genomes and characterized their evolution, defining the regions and mechanisms of change. We then developed methods for direct identification of genes and regulatory motifs. The gene analysis yielded a major revision to the yeast gene catalogue, affecting approximately 15% of all genes and reducing the total count by about 500 genes. The motif analysis automatically identified 72 genome-wide elements, including most known regulatory motifs and numerous new motifs. We inferred a putative function for most of these motifs, and provided insights into their combinatorial interactions. The results have implications for genome analysis of diverse organisms, including the human. Nature. 2003 May 15;423(6937):241-54 . PMID: 12748633 1. The genome sequence of the filamentous fungus Neurospora crassa. Galagan, Calvo, Borkovich, Selker, Read, Jaffe, FitzHugh, Ma, Smirnov, Purcell, Rehman, Elkins, Engels, Wang, Nielsen, Butler, Endrizzi, Qui, Ianakiev, Bell-Pedersen, Nelson, Werner-Washburne, Selitrennikoff, Kinsey, Braun, Zelter, Schulte, Kothe, Jedd, Mewes, Staben, Marcotte, Greenberg, Roy, Foley, Naylor, Stange-Thomann, Barrett, Gnerre, Kamal, Kamvysselis, Mauceli, Bielke, Rudd, Frishman, Krystofova, Rasmussen, Metzenberg, Perkins, Kroken, Cogoni, Macino, Catcheside, Li, Pratt, Osmani, DeSouza, Glass, Orbach, Berglund, Voelker, Yarden, Plamann, Seiler, Dunlap, Radford, Aramayo, Natvig, Alex, Mannhaupt, Ebbole, Freitag, Paulsen, Sachs, Lander, Nusbaum, Birren Neurospora crassa is a central organism in the history of twentieth-century genetics, biochemistry and molecular biology. Here, we report a high-quality draft sequence of the N. crassa genome. The approximately 40-megabase genome encodes about 10,000 protein-coding genes--more than twice as many as in the fission yeast Schizosaccharomyces pombe and only about 25% fewer than in the fruitfly Drosophila melanogaster. Analysis of the gene set yields insights into unexpected aspects of Neurospora biology including the identification of genes potentially associated with red light photobiology, genes implicated in secondary metabolism, and important differences in Ca2+ signalling as compared with plants and animals. Neurospora possesses the widest array of genome defence mechanisms known for any eukaryotic organism, including a process unique to fungi called repeat-induced point mutation (RIP). Genome analysis suggests that RIP has had a profound impact on genome evolution, greatly slowing the creation of new genes through genomic duplication and resulting in a genome with an unusually low proportion of closely related genes. Nature. 2003 Apr 24;422(6934):859-68 . PMID: 12712197 Conference Proceedings C01. Crust: A new Voronoi-Based Surface Reconstruction Algorithm Amenta, Bern, Kellis We describe our experience with a new algorithm for the reconstruction of surfaces from unorganized sample points in IR3. The algorithm is the first for this problem with provable guarantees. Given a "good sample" from a smooth surface, the output is guaranteed to be topologically correct and convergent to the original surface as the sampling density increases. The definition of a good sample is itself interesting: the required sampling density varies locally, rigorously capturing the intuitive notion that featureless areas can be reconstructed from fewer samples. The output mesh interpolates, rather than approximates, the input points. Our algorithm is based on the three-dimensional Voronoi diagram. Given a good program for this fundamental subroutine, the algorithm is quite easy to implement. ACM SIGGRAPH, v. 32, p. 415-421, Jul 19, 1998 . C02. Whole-Genome Comparative Annotation and Motif Discovery in Multiple Yeast Species Kellis, Patterson, Birren, Berger, Lander In Kellis et al 2003 we reported the genome sequences of S. paradoxus, S. mikatae and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genome-wide comparative analysis allowed the identification of functionally important sequences, both coding and non-coding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. We developed methods for the automatic comparative annotation of the four species and the determination of orthologous genes and intergenic regions. The algorithms enabled the automatic identification of orthologs for more than 90% of genes despite the large number of duplicated genes in the yeast genome, and the discovery of recent gene family expansions and genome rearrangements. We also developed a test to validate computationally predicted protein-coding genes based on their patterns of nucleotide conservation. The method has high specificity and sensitivity, and enabled us to revisit the current annotation of S.cerevisiae with important biological implications. We developed statistical methods for the systematic de-novo identification of regulatory motifs. Without making use of coregulated gene sets, we discovered virtually all previously known DNA regulatory motifs as well as several noteworthy novel motifs. With the additional use of gene ontology information, expression clusters and transcription factor binding profiles, we assigned candidate functions to the novel motifs discovered. Our results demonstrate that entirely automatic genome-wide annotation, gene validation, and discovery of regulatory motifs is possible. Our findings are validated by the extensive experimental knowledge in yeast, confirming their applicability to other genomes ACM RECOMB, p. 157-166, Apr 13, 2003 . C03. Phylogenetically and Spatially Conserved Word Pairs Associated with Gene-Expression Changes in Yeasts Chiang, Moses, Kellis, Lander, Eisen Transcriptional regulation in eukaryotes is often multifactorial, involving multiple transcription factors binding to the same transcription control region (e.g., upstream activating sequences and enhancers), and to understand the regulatory content of eukaryotic genomes it is necessary to consider the cooccurrence and spatial relationships of individual binding sites. The identification of sequences conserved among related species (often known as phylogenetic footprinting) has been successfully used to identify individual transcription factor binding sites. Here, we extend this concept of functional conservation to higher-order features of transcription control regions involved in the multifactorial control of gene expression. We used the genome sequences of four yeast species of the genus Saccharomyces to identify sequences potentially involved in multifactorial control of gene expression. We found 1,117 potential regulatory "templates": pairs of hexameric sequences that are jointly conserved in transcription regulatory regions and also exhibit non-random relative spacing. Many of the individual sequences in these templates correspond to known transcription factor binding sites, and the sets of genes containing a particular template in their transcription control regions tend to be differentially expressed in conditions where the corresponding transcription factors are known to be active. The incorporation of both joint conservation and spacing constraints of sequence pairs predicts groups of target genes that were specific for common patterns of gene expression. Our work suggests that positional information, especially the relative spacing between transcription factor binding sites, may represent a common organizing principle of transcription control regions. ACM RECOMB, p. 84-93, Apr 13, 2003 . C04. Network motif discovery using subgraph enumeration and symmetry breaking Grochow, Kellis The study of biological networks and network motifs can yield significant new insights into systems biology. Previous methods of discovering network motifs - network-centric subgraph enumeration and sampling - have been limited to motifs of 6 to 8 nodes, revealing only the smallest network components. New methods are necessary to identify larger network sub-structures and functional motifs. Here we present a novel algorithm for discovering large network motifs that achieves these goals, based on a novel symmetry-breaking technique, which eliminates repeated isomorphism testing, leading to an exponential speed-up over previous methods. This technique is made possible by reversing the traditional network-based search at the heart of the algorithm to a motif-based search, which also eliminates the need to store all motifs of a given size and enables parallelization and scaling. Additionally, our method enables us to study the clustering properties of discovered motifs, revealing even larger network elements. We apply this algorithm to the protein-protein interaction network and transcription regulatory network of S. cerevisiae, and discover several large network motifs, which were previously inaccessible to existing methods, including a 29-node cluster of 15-node motifs corresponding to the key transcription machinery of S. cerevisiae. ACM RECOMB, p. 92-106, Apr 21, 2007 . C05. Information-Theoretic Inference of Regulatory Networks Using Backward Elimination Meyer, Marbach, Roy, Kellis Unraveling transcriptional regulatory networks is essential for understanding and predicting cellular responses in different developmental and environmental contexts. Information-theoretic methods of network inference have been shown to produce high-quality reconstructions because of their ability to infer both linear and non-linear dependencies between regulators and targets. In this paper, we introduce MRNETB an improved version of the previous information-theoretic algorithm, MRNET, which has competitive performance with state-of-the-art algorithms. MRNET infers a network by using a forward selection strategy to identify a maximally-independent set of neighbors for every variable. However, a known limitation of algorithms based on forward selection is that the quality of the selected subset strongly depends on the first variable selected. In this paper, we present MRNETB, an improved version of MRNET that overcomes this limitation by using a backward selection strategy followed by a sequential replacement. Our new variable selection procedure can be implemented with the same computational cost as the forward selection strategy. MRNETB was benchmarked against MRNET and two other information-theoretic algorithms, CLR and ARACNE. Our benchmark comprised 15 datasets generated from two regulatory network simulators, 10 of which are from the DREAM4 challenge, which was recently used to compare over 30 network inference methods. To assess stability of our results, each method was implemented with two estimators of mutual information. Our results show that MRNETB has significantly better performance than MRNET, irrespective of the mutual information estimation method. MRNETB also performs comparably to CLR and significantly better than ARACNE indicating that our new variable selection strategy can successfully infer high-quality networks. BioComp10, 12 pages, July 13, 2010 . Book Chapters B01. Gene finding using multiple related species: a classification approach Kellis Ideally, we should be able to systematically discover all the functional genes in a newly sequenced genome from its sequence alone. Computational discovery methods rely both on the direct signals used by the cell to guide transcription, splicing, and translation, and also on indirect signals such as evolutionary conservation. In this paper, we summarize the principles of a classification-based approach for systematic gene identification, on the basis of comparative sequence information from multiple, closely related species. We first frame gene identification as a classification problem of distinguishing real genes from spurious gene predictions. We then present the Reading Frame Conservation (RFC) test, a new computational method implementing such a classification approach, on the basis of the patterns of nucleotide changes in the alignment of orthologous regions. We finally summarize our results of applying this method to reannotate the yeast genome, and the challenges of using related methods to discover all functional genes in the human genome. Encyclopedia of Genetics, Genomics, Proteomics, John Wiley Sons, 7 pages, July 15, 2005 . B02. A Phylogenomic Approach to the Evolutionary Dynamics of Gene Duplication in Birds Organ, Rasmussen, Baldwin, Kellis, Edwards Gene duplication is a fundamental aspect of genome evolution that produces large and small gene families of (usually) related function. We perform a phylogenomic analysis of gene duplication in the chicken (Gallus gallus) to characterize the dynamics and evolution of gene duplication on the evolutionary line to birds. In Gallus, the distribution of the number of paralogs per gene family is heavily skewed towards small families. This finding is in accord with other studies that find gene family size typically follows a power-law distribution in animals, a pattern thought to be produced by differential rates of pseudogenization among families. We also test for within-family evolutionary rate variation in Gallus, finding that the vast majority of gene families exhibit substantial rate variation among lineages. This rate variation probably stems from two sources: natural deviations in the clock as commonly found, for example, in phylogenetic analyses of different species; and bursts of adaptive evolution among newly evolved gene family members. The age of gene duplications in Gallus are distributed exponentially, with most duplications occurring recently, a pattern consistent with analyses on other eukaryotes. Taken together, these results begin to reveal the dynamics of gene family evolution in birds, the most speciose group of living amniotes, though whole genome data are required from more bird and reptile species to fully understand patterns of gene gain and loss in this group. Evolution after Gene Duplication, Wiley-Blackwell, 17 pages, October 13, 2010 . Theses and Internal Reports M01. Imagina: Sketch-based Image Retrieval using Cognitive Abstraction Kellis, Marina As digital media become more popular, corporations and individuals gather an increasingly large number of digital images. As a collection grows to more than a few hundred images, the need for search becomes crucial. This thesis is addressing the problem of retrieving from a small database a particular image previously seen by the user. This thesis combines current ndings in cognitive science with the knowledge of previous image retrieval systems to present a novel approach to content based image retrieval and indexing. We focus on algorithms which abstract away information from images in the same terms that a viewer abstracts information from an image. The focus in Imagina is on the matching of regions, instead of the matching of global measures. Multiple representations, focusing on shape and color, are used for every region. The matches of individual regions are combined using a saliency metric that accounts for di erences in the distributions of metrics. Region matching along with con guration determines the overall match between a query and an image. Masters Thesis, MIT Libraries, 157 pages, May 20, 1999 . M02. Computational Comparative Genomics: Genes, Regulation, Evolution Kellis Understanding the biological signals encoded in a genome is a key challenge of computational biology. These signals are encoded in the four-nucleotide alphabet of DNA and are responsible for all molecular processes in the cell. In particular, the genome contains the blueprint of all protein-coding genes and the regulatory motifs used to coordinate the expression of these genes. Comparative genome analysis of related species provides a general approach for identifying these functional elements, by virtue of their stronger conservation across evolutionary time. In this thesis we address key issues in the comparative analysis of multiple species. We present novel computational methods in four areas (1) the automatic comparative annotation of multiple species and the determination of orthologous genes and intergenic regions (2) the validation of computationally predicted protein-coding genes (3) the systematic de-novo identification of regulatory motifs (4) the determination of combinatorial interactions between regulatory motifs. We applied these methods to the comparative analysis of four yeast genomes, including the best-studied eukaryote, Saccharomyces cerevisiae or baker's yeast. Our results show that nearly a tenth of currently annotated yeast genes are not real, and have refined the structure of hundreds of genes. Additionally, we have automatically discovered a dictionary of regulatory motifs without any previous biological knowledge. These include most previously known regulatory motifs, and a number of novel motifs. We have automatically assigned candidate functions to the majority of motifs discovered, and defined biologically meaningful combinatorial interactions between them. Finally, we defined the regions and mechanisms of rapid evolution, with important biological implications. Our results demonstrate the central role of computational tools in modern biology. The analyses presented in this thesis have revealed biological findings that could not have been discovered by traditional genetic methods, regardless of the time or effort spent. The methods presented are general and may present a new paradigm for understanding the genome of any single species. They are currently being applied to a kingdom-wide exploration of fungal genomes, and the comparative analysis of the human genome with that of the mouse and other mammals. Ph.D. Thesis, MIT Libraries, 100 pages, May 25, 2003 . I01. WebBot: Constraint Model for a Web Robot Kellis, Nielsen The W3C robot's purpose in crowling the web is not to accumulate data from web sites, to generate a database, or an index, or to update documents, or verify links, even though all of the functionality described above can easily be added, without changing the basic skeleton of the robot as it is today. The primary purpose of the robot is to understand the relations between web pages, and not the content of the pages themselves. How many paths can lead to a given page, what tree structures can be generated by following links under a certain constraint, how many degrees of separation lie between any two pages, what is the shortest path to reach a site, and how can intelligent navigation be automated. The robot will also be used as a test tool interacting with the w3c-libwww-99. The robot code is written in Tcl, the graphic interface to the user is written in Tk, and the library code used to provide all the network functionality is written in C. All functionality is therefore portable in almost any system with network access. The most elementary web robot can fetch a web document, extract the links from that document, follow the urls, fetch the documents, and restart for every document. The leads to an exponentially growing number of documents to follow every time you reach a new depth. Therefore, for a robot to be of any use, we need a way of specifying a Constraint Model that will allow the robot to choose which documents to follow and which not to. The Constraint Model is basically a set of rules, that the user specifies, or that the program itself can formulate, to follow a crawling model specified by the user at the beginning, or during the crawling. The next section describes the consituent blocks of the rules (or Rule Atoms) and the ways to combine those rules. World Wide Web Consortium, Aug 1996. I02. EciMorph: Curve Morphing in Extended Gaussian Space Kellis Morphing techniques that have been developed to progressively transform one two-dimensional image to another are usually pixel based and morph a source to a target by interpolating pixel values based on constraints specified by the user. In this work, we are using a model-based approach to propose a solution to morphing a two-dimensional polygon into another. Our model of an object will be its Extended Circular Image representation, the two-dimensional equivalent of the Extended Gaussian Image. This approach will work only for convex polygons, for which the ECI is unique. To morph non-convex polygons, one would have to separate them into convex components and morph each separately, or use an extended version of the ECI that would apply to non-convex polygons. This morphing method can be applied to convex polyhedra using the Extended Gaussian Image, and similarly be extended to non-convex polyhedra. MIT Machine Vision Group, Dec 1997. I03. 3DMorph: Polygon Model Morphing Kellis, Gajos, Blum, Vassef Morphing is an interpolation technique used to create from two objects a series of intermediate objects that change continuously to make a smooth transition from the source to the target. Morphing has been done in two dimensions by varying the values of the pixels of one image to make a different image, or in three dimensions by varying the values of three-dimensional pixels. We're presenting here a new type of morphing, which transforms the geometry of three dimensional models, creating intermediate objects which are all clearly defined three-dimensional objects, which can be translated, rotated, scaled, zoomed-into. MIT Lab for Computer Science, Dec 1997. I04. RoboLogo: Programming Environment for Interactive Robots Kellis, Lueck, Rohrs RoboLogo is a system that enables children to program interactive robots. Children can program a robotic truck that interacts with the environment without having to deal with low-level implementation details. Using iLogo, a high-level interactive language, the children can easily describe the robot's actions in the environment and just as easily program the robot's reactions to external events, such as encountering an obstacle. For example, to program a robot that bounces between two walls, the user need only write: do.while (or readsensor fl readsensor fr) do.unless (or readsensor bl readsensor br) ] TRUE This iLogo program is parsed and translated to assembly code. The resulting A51 file is linked with the low-level RoboLogo library routines and compiled into machine code. The Intel HEX file thus generated is downloaded onto the RoboLogo truck via the serial port. Unaware of the details, the child can now disconnect the programmed robot watch it run, and interact with it. The goal of our 6.115 project was to enable this transition from the abstract description of the robot's behavior to an actual moving robot. This involved managing the mechanics and sensors of a robotic truck, laying out the electronics to control it, writing the assembly routines that coordinate its actions and sensors, developing the iLogo language, and implementing the compiler that translates iLogo to assembly code. MIT Microprocessor Lab, Dec 1998. I05. Mood: Music Classification using Patterns of Attentional State Kellis, Marina, Vassef, Carter To understand a pattern, you have to be in the right mood, the right attentional state. We constructed a pattern recognition system that modelled the state transitions of the human brain, as described by Seymour Ullman and implemented by Sajit Rao for the visual cortex. We used the same states that Sajit used for vision, namely: Select Location Focus attention on new location Establish Properties at the new attentional state Learn low-level patterns and patterns of state changes. What makes the system able to generalize from one music piece to an entire category is the one thing that remains constant within a class of music compositions: namely, the pattern of changes in attentional state. MIT AI Lab, May 1998. I06. Invest: Applications of AI to Stock Market Prediction Kellis, Mehrotra, Warshawsky This paper describes our attempt at implementing a stock price prediction system implementing various AI techniques. We found relatively poor performance across the gamut of techniques attempted. We conclude that stock prices are a random walk across the general mean of the system. Therefore, we propose that it is virtually impossible to predict with a high degree of certainty how a stock price will perform in the future. AI in Practice Project, May 1998. I07. Handwritten character recognition using wavelets Kellis This paper presents our experience with an curvature space for online character recognition. The (x,y) coordinate data along the character curve is transformed to local curvature data. The peaks of this curvature represent the sharp turns made by the pen in constructing the character. The height and timestamp of these peaks are used as features. The number of peaks for a single character rarely exceeds five or six. Thus, we have transformed the 256 dimensional data of off-line recognition on a 16x16 grid, to less than 10 dimensional data. This aggressive reduction in dimensionality is certainly lossy. However this paper makes the claim that the curvature peaks preserve the information most important to character structure. To demonstrate the claim, we construct a classifier for handwritten digits that bases its decisions solely on the few peaks in curvature space. The results are promising: 200 digits of each character were correctly classified based on a HMM mixture model constructed from 300 examples of each digit. Unsupervised learning techniques are also envisioned. The character data plotted in the lower dimensional space suggests distinct clusters emerge, that could separate and group trained symbols, as well as recognize new symbols as categories. MIT AI Lab, Dec 1999.
个人分类: ENCODE|4022 次阅读|1 个评论
[转载]matlab相关
enthueric 2012-8-20 12:41
matlab保存图片 http://s400.blog.163.com/blog/static/174290893201121795724998/
0 个评论
[转载]网文被人抄袭,丁香园网友求助
windlight 2012-7-25 17:14
http://www.dxy.cn/bbs/thread/23502103#23502103 happyjylwrote: 本人最早于 2010 年 7 月发表的帖子《注册部门在药物研发中的作用》被沈阳兴齐制药有限公司朱晶、陈静、毕爽剽窃,改头换面后以《药品注册在新药研发中的作用》为题发表在《北方药学》 2011 年第 8 卷第 4 期,并号称是 “ 根据多年来在新药研发及注册过程中所积累的点滴经验和体会 ” 写成。 比较《药品注册在新药研发中的作用》一文与我的原帖, 60 %的内容极其相似,最后一段尤为明显。我的原文是 “ 随着注册部门越来越多地参与药物的早期研发, 注册人员在其中也起着越来越重要的作用。尽管不同企业由于规模和产品线的不同,对注册人员的要求也不尽相同,但总的来说,要想把注册工作做好需要丰富深厚的专业知识、圆润通达的沟通技巧和敬业尽责的工作态度,这一点对任何工作都适用。一项工作可以很简单,也可以很复杂。这往往不取决于你在什么样的公司、什么样的职位,而取决于你希望做到什么程度、愿意付出多少努力。注册是个门槛低、要求高的工作,有很大的挑战性,但也有广阔的成长空间。 欢迎有兴趣的药学同行们加入药品注册的队伍! ” 粗体字部分被朱晶、陈静、毕爽原封不动照搬。 我的原帖《注册部门在药物研发中的作用》最早作为仪器信息网第三届原创大赛作品参赛,于 2010 年 7 月发表于仪器信息网。大赛结束后,本人于 2010 年 11 月分别发表于丁香园 “ 新药与信息讨论版 ” 和 SFDAIED 论坛 “ 药品申报注册讨论区 ” ,在丁香园被作为精华帖推荐,在 SFDAIED 论坛被版主加了 3 个威望。我在论坛发帖一直奉行学习交流的原则,发表帖子从不设置积分限制,没有积分的新手也可以看帖。我在论坛上传的资料也从不收取积分,在仪器信息网上传的上百篇资料一直设为免费下载;从事药品注册以来近七年连续不断搜集整理的 Excel 表格《专业英语》光是在 SFDAIED 论坛上就被下载超过 5000 次,同样是免费资料。这样做无非是因为我也曾是新手,也曾为获取积分下载资料而煞费苦心地发帖灌水。我理解有些人转发时嫌麻烦而不注明原作者的行为,也理解有些人为获取积分而不顾原作者呼吁在转载时设置为收费资料的举动,但这样明目张胆的剽窃是我从未预料到的。是可忍,孰不可忍? 公开发表的学术论文尚且可以造假,其人编写的注册资料又能有多少可信度呢?我情愿相信朱晶、陈静、毕爽只是沈阳兴齐制药的害群之马,造假行为仅限于发表论文一事。 在此附上 原帖地址以及朱晶、陈静、毕爽改头换面后的文章供大家比较。 恳请《北方药学》杂志社的编辑查明真相,在《北方药学》上发表声明澄清事实。谢谢! 《注册部门在药物研发中的作用》原帖地址: 仪器信息网: http://bbs.instrument.com.cn/shtml/20100718/2668966/ 丁香园: http://www.dxy.cn/bbs/topic/18687632 SFDAIED 论坛: http://bbs.sdatc.com/read.php?tid=207108 朱晶、陈静、毕爽改头换面后的《药品注册在新药研发中的作用》(此文的 PDF 版本已作为附件上传到以上三大论坛): http://www.cnki.com.cn/Article/CJFDTotal-BFYX201104046.htm 附:以下是我在各论坛发表过的技术交流帖。将 《注册部门在药物研发中的作用》一文与以下帖子比较, 从风格到文笔都可以证实是我本人所作。 尝试解读《药品技术审评原则和程序》: http://www.dxy.cn/bbs/topic/19730492 对 CDE 电子刊物《浅谈手性药物的药学研究》的图解: http://www.dxy.cn/bbs/topic/21760578 《药品技术转让注册管理规定》 翻译稿及讨论: http://bbs.instrument.com.cn/shtml/20090906/2098176/ EP6.0 美沙拉秦专论翻译稿及讨论: http://bbs.instrument.com.cn/shtml/20081104/1566105/ 《进口药品注册申请表》翻译稿及讨论: http://bbs.instrument.com.cn/shtml/20080903/1460891/ 《现场核查及抽样参考要求》 翻译稿及讨论: http://bbs.instrument.com.cn/shtml/20071203/1079444/ 药品证书、一般指南和主管药师承诺书翻译稿及讨论: http://bbs.instrument.com.cn/shtml/20070708/902917/ 《 FDA 发布的口腔崩解片指导原则草案及讨论》翻译稿及讨论: http://bbs.instrument.com.cn/shtml/20070601/860826/ HPLC 常用句型归纳: http://bbs.instrument.com.cn/shtml/20070328/785502/ 《关于中药剂型改变研究的思考(二)》 翻译稿及讨论: http://bbs.instrument.com.cn/shtml/20070311/764023/ 《 关于进一步加强含麻黄碱类复方制剂管理的通知 》翻译稿及讨论: http://bbs.sdatc.com/read.php?tid=134622 阵痛中成长――八年药学工作感悟: http://www.dxy.cn/bbs/topic/21519624 新的起点,新的开始――五年注册工作感受: http://www.dxy.cn/bbs/topic/18687574 努力在当下,壮心存未来――四年注册工作感悟: http://bbs.instrument.com.cn/Topic.asp?threadid=2199718 注册无小事 ―― 三年注册工作回顾 : http://bbs.instrument.com.cn/Topic.asp?threadid=1573903 从事项目管理 六 个月感想小结 : http://www.dxy.cn/bbs/topic/22220881 从事项目管理 四 个月感想小结 : http://www.dxy.cn/bbs/topic/21896724 从事项目管理三个月感想小结 : http://www.dxy.cn/bbs/topic/21896709 我上传的免费资料帖: 我在药品注册工作中积累的药学英语: http://bbs.sdatc.com/read.php?tid=121538 FDA 指导原则翻译稿――第一批: http://bbs.instrument.com.cn/shtml/20090925/2126918/ FDA 指导原则翻译稿――第二批: http://bbs.instrument.com.cn/shtml/20100610/2604964/ 零分下载我上传的资料: http://bbs.instrument.com.cn/shtml/20070906/971296/
个人分类: 未分类|3505 次阅读|0 个评论
泊松随机数
huangyanxin356 2012-4-12 13:49
poissrnd - Poisson random numbers Syntax R = poissrnd(lambda) R = poissrnd(lambda,m,n,...) R = poissrnd(lambda, ) Description R = poissrnd(lambda) generates random numbers from the Poisson distribution with mean parameter lambda. lambda can be a vector, a matrix, or a multidimensional array. The size of R is the size of lambda. R = poissrnd(lambda,m,n,...) or R = poissrnd(lambda, ) generates an m-by-n-by-... array. The lambda parameter can be a scalar or an array of the same size as R. Examples Generate a random sample of 10 pseudo-observations from a Poisson distribution with λ = 2. lambda = 2; random_sample1 = poissrnd(lambda,1,10) random_sample1 = 1 0 1 2 1 3 4 2 0 0 random_sample2 = poissrnd(lambda, ) random_sample2 = 1 1 1 5 0 3 2 2 3 4 random_sample3 = poissrnd(lambda(ones(1,10))) random_sample3 = 3 2 1 1 0 0 4 0 2 0
个人分类: 技术类|161 次阅读|0 个评论
[转载]Young Scientists in Love【Science】
yhy188 2012-2-20 10:13
http://www.sciencemag.org/content/335/6070/780.1.full.pdf Lonely Chinese researchers isolated by shy- ness and long lab hours now have an online dating service designed just for them. Building 88 (http://www.sciencedate. cn/) aims to become a “soul harbor” for young Chinese scientists, its Web site explains. Spearheaded by Science Times Media Group, which also runs the popular Chinese-language news portal Sciencenet. cn, the service takes its name from a fabled Beijing dormitory that became a meet-up spot for Chinese Academy of Sciences researchers in the 1990s. Today’s scientists lack such gather- ing places, says site coordinator Wu Hao, and thus have a “more intense need” for social interaction than Chinese pursu- ing other careers. “The social circles of young Chinese scientists are often lim- ited to other people in their fi eld,” Wu explains. That may be no accident. A question- naire Science Times distributed to 1243 young scientists revealed roughly 70% of highly educated respondents suffer from social anxiety, Wu says. Building 88 broad- ens the pool of potential paramours for introverts to include Chinese researchers both within China and all over the world. Since its launch in January, the site has grown to 1000 users, most of them between the ages of 20 and 35. Whether an online portal can help draw them out of their shells is still an experiment in progress; many scientists have unusu- ally high standards, Wu notes. And as one 31-year-old Beijing scientist puts it on his Building 88 profi le: “Dating is just like scientifi c research: Only when you’re excited about it do you get results.
2318 次阅读|0 个评论
有趣的主页
wjwzl 2012-2-2 14:37
现在记录一下平时积累的有趣的个人主页,便于回忆和启发灵感。 http://people.sc.fsu.edu/~jburkardt/
个人分类: geeker favorate|1779 次阅读|0 个评论
局域键平均1:纳米氧化钛在变温、变压和变尺度下的弹性与振动
热度 1 ecqsun 2011-11-6 05:17
局域键平均1:纳米氧化钛在变温、变压和变尺度下的弹性与振动
1、量纲分析得出,固体弹性模量正比于能量密度【帕】: B ~ E/d 3 ; 2、晶体势的泰勒展开导出,拉曼频移与键序,键长、键能、约化质量相关:delta omigac~ z/d*(E/m) 0.5 3、对于同一材料,固体弹性模量与拉曼频移通过 (Bd) 0.5 == delta Omiga 相关,即刚度的方根; 4、温度、压强、固体尺度变化直接影响键长和键能,从而改变弹性模量和声子频率; 5、通过变温可测德拜温度和单原子结合能; 6、通过变压可测压缩系数和能量密度; 7、通过改变颗粒尺度可测间的性质参数和相应的块体值; 8、结果表明,141 cm -1 模由于尺寸减小引起的蓝移是原子对振动的后果;其它振动模红移是由所有紧邻作用所致。 Strain engineering of the elasticity and the Raman shift of nanostructured TiO2 http://www3.ntu.edu.sg/home/ecqsun/RTF/JAP-TiO.pdf
个人分类: 断键非键|3786 次阅读|2 个评论
市场上常见流式数据分析软件和FlowJo之比较
热度 1 FlowJo 2011-3-8 03:36
市场上常见流式数据分析软件和FlowJo之比较
Why FlowJo? 很多人有这个问题,仪器本身就带有软 件,我为什么还需要第三方软件? 为了让大家了解不同软件之间的区别和功能, 下面将市场上常见的几款机器自身携带的软件 和FlowJo做一个系统的比较。 1.操作灵活性 FlowJo 可以在几乎所有的电脑操作系统上 使用, 你的电脑可以是PC的各种Windows系统 (WinXP, Win 7, Vista), 也可以是苹果 Macintosh系统,甚至FlowJo也可以在Unix系 统上操作。 相对于市场上其它的软件来说,像Becton Dickinson的CellQuest,只能在苹果机上使用, 限制了大部分的PC用户。而另外一款机器自 带的FACSDiva软件,则只能在其特定配置的 机器电脑上使用。FlowJo软件则可以安装到用 户自己的电脑上使用。只要电脑最低配置有 512MB内存,10G的空余磁盘空间就可以运 行。使用FlowJo的用户可以将数据从流式仪上 复制到自己的电脑上,从流式实验室繁杂的环 境脱离出来,在用户喜欢的工作环境中,边喝 咖啡或茶,边分析数据,而不必拘泥于对食物 和衣着有严格限制的实验空间。原来枯燥的实 验数据分析可以这样的简单轻松! 2.语言环境 FlowJo中文版,可以让广大的流式用户, 在自己熟悉的母语环境中分析数据。你不必去 猜测那些深奥难懂的英文术语,也不必犹豫猜 测某个英文窗口是什么意思。你可以很轻松的 将FlowJo的操作环境从英文更改为中文,而不必重新安装软件。FlowJo是目前市场上唯一一 款中文操作环境的流式分析软件。 FlowJo提供中文的操作手册,中文视频教 程和中文的技术支持服务。减少了很多语言的 障碍,尤其是对于英语不是他们的第二语言的 用户来说。 3.功能优势之-简单易学 很多流式用户希望可以很快的得到分析结 果,而不是花很多时间在学习一项新的工具 上。那FlowJo恰恰是满足了这个需求。不管你 有没有接触过流式及流式方面的基础知识,不 管你的背景,年龄,打开FlowJo之后,保证你 10分钟就可以上手,对你的实验数据进行分析 设门及数据导出!即使是80岁的老奶奶我们都 有信心其可以学会及掌握FlowJo! FlowJo其独特的 流程和设计,让数 据分析像呼吸吃饭 一样的自然和简 单。这是其它软件 所不能企及的。 4.功能优势之-多种特殊分析模型 FlowJo 除了常用的流式免疫表型之类的 分析功能之外,还附带多种特殊分析模型, 如细胞周期分析,动力学分析(钙离子流量)及 增殖模型。这些都是软件自带功能并不需额 外购买。仪器自带的软件都不具备这些功 能。 5.功能优势之-采集后荧光补偿 FlowJo可以做数据的采集后荧光补偿。 BD公司的CellQuest,不具备这个功能。所谓 采集后补偿及在采集了实验数据和单染样品 之后,再在软件里进行荧光的补偿调节。很 多人倾向于在数据采集的时候进行荧光补 偿,并错误的认为那才是最正确的补偿方 式,其实这是一种非常错误的认识。在模拟 信号的机器上的荧光补偿很多时候是目测为 准,及大家都尊从的”横平竖直“的标准。目测 调整荧光条件可能会导致巨大的误差而不自 知。正确的方法是用方程式去计算阳性和阴 性细胞群的荧光强度中位数值(MFI, median fluorescence intensity),从而得到最佳的补 偿条件。Diva软件提供软件的自动计算功 能,但是却不能轻松的进行补偿后的调整。 FlowJo同时具备自动计算和补偿后的再调 节。是FlowJo里,你也可以再次的对你的补 偿条件进行微调,再也不用因为补偿出错而 懊恼了。 6.功能优势之-贴心功能设计 ✦FlowJo里你可以自定义门的名称,如命 名门为”淋巴门”, “FoxP3阳性门” 或者英文 的”lymphocytes”, “FoxP3+cells”等等。各个 门不再是干巴巴的代号,而是有血有肉的精 确命名的细胞群。 ✦图片的叠加,FlowJo图片叠加功能非常 好用,只需将2个图片拖在一起即可。而其它 的软件则可能需要很复杂的步骤才可以办到 或者是根本没有这个功能。叠加的图片还可 以制作成立体的三维图。 ✦直接将数据导出到Excel文件!FlowJo可 以很轻松的将数据导出到Excel文件,省去了 很多抄写的麻烦。在很多老式的流式仪上, 如果你想要将数据结果导出来的话,只能打 印最终的分析文稿,或者手动抄写下来。欢迎你步入21世纪!FlowJo可以一步式操作, 直接将所有的样品数据导出为Excel文件。 ✦3D 视图 ✦动画数据导出。你可以将图片数据以动 画的形式导出为一个动态电影。 7.功能优势之-强大的分析功能设计 你知道什么是数据联接吗(Concatenate)? 顾名思义,就是将2个或者2个以上的FCS数 据,整合为一个FCS数据。这个数据联接功 能在以下情况下,可以帮你解决燃眉之急或 者将你的数据分析提高到一个新的档次 ✦同一管样品,不小心分2次采集,生成了 2个FCS数据。如果你要分析的是表达量很低 的细胞群,单个FCS数据细胞总数不够的情 况下,你可以利用数据联接的功能将之合并 为一个数据。 ✦在抗体滴定实验,或者是样品不同时间 点的药物处理及其它涉及到实验中样品只有 一个处理条件有变,其余参数均不变的情况 下,可以将不同的数据联接为一个,并将所 有数据点图呈现在一个图像窗口里,从而观 察随着条件的变化,细胞抗体表达的渐变过 程。你知道什么是校准,衍生参数轴,和细胞 群比较功能吗? 也许你从来没有听说过这些 功能,不过没有关系,如果想要了解更多详 细内容的话,可以参阅我们的网站,你也可 以下载FlowJo软件,开始30天的免费试用来 亲手体会FlowJo的强大!你也可以拨打我们 的热线电话400-680-5527,或者email: contact@flowjochina.com 获取免费的使用说 明和学习DVD。 www.flowjochina.com www.flowjo.com 附件1. FlowJo与其它软件的功能比较表格 附件2. FlowJo被各大学术杂志引用数据 FlowJo 是被使用最广泛的流式数据分析软件,各大高影响力因子期刊发表的流式图表绝大部分 是用FlowJo分析的。如Nature Immunology收录的10篇文章里,有7.8篇文章用FlowJo做数据分 析。剩余的2.2篇是机器自带软件分析的。
个人分类: FlowJo使用|29943 次阅读|4 个评论

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-4-27 09:07

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部