HapMap Part II: 3.1 million SNPs and natural selection 
Posted by Dan Koboldt on Monday, October 29, 2007, 01:52 PM
Two publications in Nature earlier this month marked the completion of Phase II of the International HapMap Project, an epic quest to understand the pattern of genetic variation in humans.



The first publication, A second generation human haplotype map describes the addition of 2.1 million SNPs to an already-dense Phase I map of human SNPs. Now, the HapMap has one polymorphic SNP every 875 bp on average and within 5 kb of 98.6% of the assembled genome. Interestingly, because SNP selection in Phase II did not consider SNP spacing or known MAF, the results offer a better view of rare variation in the human genome.

So How Many Tag SNPs Are There?
One key finding from Phase II is that up to 1% of common variants are untaggable because they lie in recombination hotspots. For the "taggable" majority, however, it takes 552,853 tag SNPs to capture common (MAF >= 0.05) variation at r^2 of at least 0.8 in the European-derived population. As expected, this number is slightly lower for Asian-derived populations (520,111 tag SNPs) and substanitally higher (1.09 million tag SNPs) for African-derived populations. It should be noted that tagging for African populations was dramatically improved by the Phase II data.

What Did We Learn About Selection?
Of the 56,789 nonsynonymous SNPs in dbSNP release 125, the HapMap attempted genotyping for 36,777 (64.76%) and got QC-passed, polymorphic results for 17,427 (47.39% of genotyped). That's a fairly dismal validation rate compared to the rest of the genome. Relative to synonymous SNPs, nonsynonymous SNPs in the HapMap exhibited an excess of rare variation and a paucity of common variation consistent with widespread purifying selection against protein mutations. Additionally, the patterns of selection appeared stronger in the YRI panel, suggesting a reduced efficacy/strength of selection among non-African populations.

The second HapMap paper focused on positive selection in human populations. Using a modified extended-haplotype homozygosity test, Sabeti et al identified 26 nsSNPs with regional evidence of positive selection. The candidate loci contain genes involved in Lassa virus infection (in Africa), skin pigmentation (in Europe), and hair follicle development (in Asia).
3 comments ( 21 views )   |  0 trackbacks   |  related link   |   ( 3.1 / 65 )

Functional Elements of the Genome: The ENCODE Pilot Project 
Posted by Dan Koboldt on Friday, September 14, 2007, 12:12 PM


In June, a landmark publication in Nature and dozens of companion articles in other journals heralded the completion of the the ENCODE pilot project. Funded by the NHGRI in 2003, the Encyclopedia Of DNA Elements is a major undertaking by researchers at some 50 institutions to characterize functional elements in the human genome. In the pilot phase, 44 regions representing about 1% of the genome were examined. With an elegant combination of computational and experimental techniques, this effort represents a major step forward in our understanding of the function and architecture of our genetic code.

The main paper and several companion/commentary articles are listed below, but here is a quick summary of some of the key findings:
  • The human genome is pervasively transcribed. Some 74% of the bases were represented in transcripts identified by at least two different technologies, and a substantial number of the transcripts are from noncoding regions.
  • Transcription start sites (TSSs) are far more numerous than previously believed. Some 4,591 TSS clusters (many of them novel) were detected in ENCODE regions; that's almost 10 times the number of established protein-coding genes.
  • Chromatin architecture and histone modifications predict transcriptional activity. Support vector machine (SVM) modeling of histone modification data proved capable of predicting gene expression status (transcribed or not transcribed) with >90% accuracy. Transcriptionally-active regions also correlated highly with the presence of TSSs.
  • DNA replication timing is correlated with chromatin structure. "Active domains", associated with early replication, were enriched for TSSs, CpG islands, and Alu elements. Conversely, "repressed domains", associated with late replication, were enriched for LINE1 and LTR transposons.
  • Some 5% of bases in the genome are selectively constrained in mammals. Around 40% of constrained bases overlap coding exons and their UTRs, while 20% cover noncoding functional elements supported by experimental data. That leaves 40% of constrained bases whose function, if any, remains unknown. Perhaps even more surprising, many of the experimentally-identified functional elements do not appear to be constrained across mammalian evolution.

The findings of the ENCODE consortium are an achivement of collaboration and technological innovation, and offer an unparalleled view into the complexity underlying the human genome. To read more:

The ENCODE Paper
Birney, E., et al., Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 2007. 447(7146): p. 799-816.

Related Commentaries
Weinstock, G.M., ENCODE: more genomic empowerment. Genome Res, 2007. 17(6): p. 667-8.
Gerstein, M.B., et al., What is a gene, post-ENCODE? History and updated definition. Genome Res, 2007. 17(6): p. 669-81.
Henikoff, S., ENCODE and our very busy genome. Nat Genet, 2007. 39(7): p. 817-8.

Select Companion Articles
Margulies, E.H., et al., Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res, 2007. 17(6): p. 760-74.
Tress, M.L., et al., The implications of alternative splicing in the ENCODE protein complement. Proc Natl Acad Sci U S A, 2007. 104(13): p. 5495-500.
Zheng, D., et al., Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res, 2007. 17(6): p. 839-51.
King, D.C., et al., Finding cis-regulatory elements using comparative genomics: some lessons from ENCODE data. Genome Res, 2007. 17(6): p. 775-86.
Washietl, S., et al., Structured RNAs in the ENCODE selected regions of the human genome. Genome Res, 2007. 17(6): p. 852-64.
Thurman, R.E., et al., [url=http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&Cmd=ShowDetailView&TermToSearch=17568007]Identification of higher-order functional domains in the human ENCODE regions
. Genome Res, 2007. 17(6): p. 917-27.

Applications
Elnitski, L.L., et al., The ENCODEdb portal: simplified access to ENCODE Consortium data. Genome Res, 2007. 17(6): p. 954-9.
Thomas, D.J., et al., The ENCODE Project at UC Santa Cruz. Nucleic Acids Res, 2007. 35(Database issue): p. D663-7.
3 comments ( 22 views )   |  0 trackbacks   |  related link   |   ( 2.8 / 82 )

Medros, Inc. - Ross Cagan's Legacy at WashU 
Posted by Dan Koboldt on Tuesday, June 5, 2007, 11:08 AM
"We need more focus on taking what we study to the clinics." This was the opener given at last week's Genetics seminar by Ross Cagan, a 14-year WashU veteran perhaps best known for Medros, Inc., the drug discovery company he founded with fellow professor Thomas Baranski. The talk began with an overview of eye development in Drosophila, particularly the role of adhesion proteins "Hibris" and "Roughest" (homologs of Nephrin/Neph in humans) whose complementary expression directs correct geometric arrangement of ommatidia in the fly eye. Mutant phenotypes in eye development are easy to observe in flies, because they create a "ripple effect" among ommatidia that is easy to observe. Cagan and his colleagues also hoped that our extensive knowledge of epithelia in flies might offer a different approach to studying complex diseases like cancer and diabetes.

The breakthrough in using flies to model human disease arose from work on Multiple Endocrine Neoplasia Type 2 (MEN2), a cancer syndrome caused by Ret mutations whose spontaneous form was untreatable and accounts for 75% of cases. The introduction of oncogenic Ret into fly embryos causes a distinct eye-overgrowth phenotype, offering a nice model system to study the underlying cause of the human disease. Cagan and colleagues developed a high-throughput, fly-based drug screening technology which, long story short, isolated an AstraZeneca compound that rescued the overgrowth phenotype in flies and eventually proved effective against MEN2 in humans. They also identified 140 genetic modifiers in flies (enhancers and suppressors) whose homologs are likely resistance and susceptibility genes (respectively) in humans. Double knockdowns of one modifier, Csk, have enabled a fly model for oncogenesis and metastasis in Ret/MEN2 cancers.

Building fly models of human diseases allows Medros to screen drug candidates for toxicity, efficacy, bio-availability, etc. in a high-throughput and cost-effective manner. Models for cancers other than MEN2, oncogenic "cooperation", and diabetes are all in the works. Sadly, Ross Cagan is leaving St. Louis for a post at Mount Sinai School of Medicine.
6 comments ( 91 views )   |  0 trackbacks   |  related link   |   ( 3 / 114 )

Highlights of the HUGO Meeting 
Posted by Dan Koboldt on Wednesday, May 30, 2007, 04:51 PM

Last week I attended the HGM2007 meeting hosted by HUGO in Montreal, Canada. Here are some of the presentation highlights.

David Altschuler gave a talk entitled "Genomic variation and Inheritance of Common Disease" in which he mainly discussed a genome-wide association study of Type 2 Diabetes (T2D). They looked for genetic association of T2D and 18 additional complex traits (cardiovascular phenotypes, etc.) in a sizeable sample population (1464 cases and 1467 controls). Using the Affy 500K chip and applying strict QC cutoffs, they tested 386,731 markers covering around 78% of common SNPs (CEU MAF >5%) in the genome. Interestingly, at least two of the associated SNPs he mentioned were in noncoding regions: a common SNP 125kb upstream of CDKN2B, and another in the intron of TCF7L2. Bottom line: Altschuler et al identified 8 common genetic risk factors for T2D, all of which had frequencies of 26-85% in European populations. Each factor, however, had a very modest effect on risk (11-34%) - a good example of the genetics of common disease.

Another esteemed speaker was Sarah Tishkoff, who presented work on genetic structure and adaptation in African populations . Her group published some work late last year on a mutation common to East African populations that confers the ability to digest milk. The mutation appears to have arisen independently of the lactose-tolerance variant in Europeans, but affects the same gene (LCT). More interestingly, the African variant is over 14kb upstream of the LCT gene, in an intron of a different gene altogether. Dr. Tishkoff showed some results from a luciferase-expression assay that they developed to test the effect of ancestral and derived haplotypes on LCT expression. Even within Africa there appear to be at least four genetic sub-populations, suggesting that variation across ethnic groups is quite extensive on the dark continent.

The Autism Genome Project was well represented by Stephen Scherer, whose talk about chromosomal rearrangements in autism spectrum disorder (ASD) fell under the "Structural Variation" symposium. Evidently there are three main categories of clinical symptoms in ASD, behavior, social interaction, and communication/play, make up the so-called "triad of symptoms" for autism diagnosis. It has long been known that chromosomal aberrations were implicated in ASD, as 7.4% of patients have cytogenetically-visible chromosome rearrangements. Scherer et al identified some 3,443 copy number variations (CNVs) across 111 or so genomic regions, and found that 16% of their autism cases had chromosomal or CNV aberrations. Unsurprisingly, ASD is proving to be a complex disorder, with "many genetic ways to get there."

Next up, " 454 Does Jim Watson " - the Roche company seminar by Bruce Taillon described a 3.5X coverage sequencing of Jim Watson's genome. Their project offers an interesting perspective because it offers a broad picture of sequence variation in a single individual. Around 1.3 million reads did not match the current human genome sequence (NCBI b36). Of these, 20% matched the Celera assembly and 65% were repetitive sequence - clearly the human genome sequence remains a "draft assembly". Jim had 1,942,500 substitutions (SNPs), 67.8% of which matched known variants from dbSNP. That left 625,238 novel variants - at least 400,000 of these have 2+ reads and are thus likely to be real. Among known SNPs, some 50 variants he carries are listed in databases of known phenotypes (e.g. OMIM), but Roche didn't say which ones.
3 comments ( 7 views )   |  0 trackbacks   |  related link   |   ( 3 / 100 )

Breakpoint read work presented at MC-GARD meeting 
Posted by Dan Koboldt on Wednesday, May 9, 2007, 01:03 PM

Last week I attended the first MC-GARD (Marie Curie - Genome Architecture in Relation to Disease) meeting, titled "Molecular Profiling of the Genome". The conference, chaired by Bauke Ylstra, was hosted at the VU University Medical Center in Amsterdam, the Netherlands. Lars Feuk (Hospital for Sick Children in Toronto) set the tone for the meeting with his keynote lecture titled Discovery of Structural Variation: New insights into disease. He described the Database of Genomic Variants hosted at Sickkids as well as some work on CNVs and autism.

Many of the talks discussed copy number variants (CNVs) and their relationship to human cancers. Copy number analysis with array CGH (aCGH) was a popular topic. On Friday (4 May), presentations opened with a very nice talk by Erwin Schurr (McGill Univ.) on the host genetics of tuberculosis and leprosy. He won points for discussing the common disease common variant (CDCV) hypothesis and the HapMap Project . Their work, however, focuses on finding "major gene effects" modulating the immunogenetics of infection. Counterintuitively, the major gene they found exhibited a dominant (not recessive) effect in Tb/Leprosy.

The invited speaker for my section was Mathew Hurles (Sanger), who described a very nice body of work to build the comprehensive map of copy number variation in the human genome. Their consortium has used high-resolution aCGH analysis of the HapMap samples to construct a dense map of CNVs >500 bp in size, a valuable addition to the haplotype map. In addition, they collaborated with E. Dermitzakis on correlating CNVs with heritable changes in gene expression.

The data Mat showed were impressive and made an excellent lead-in to my talk concerning high-throughput identification of structural variations from sequence trace data, which will be covered in another posting.
5 comments ( 30 views )   |  0 trackbacks   |  related link   |   ( 3.1 / 117 )

Average DAF of dbSNP Functional Classes 
Posted by Dan Koboldt on Thursday, April 19, 2007, 11:13 AM
I have calculated derived allele frequency (DAF) values for 2,539,864 SNPs characterized by the International HapMap Project in four populations of different ancestry. Although dbSNP offers only limited annotations of functional relevance for SNPs, I thought it might be interesting to plot the average DAF values for each of their functional classes (nonsynonymous coding, splice site, synonymous coding, mRNA-UTR, intron, locus, and unknown).

Differences in average derived allele frequency are apparent both between HapMap panels, and between dbSNP functional classes. The increased relative DAF values for Europeans and Asians is consistent with a population bottleneck in recent human history. Looking at the functional classes, splice-site SNPs were the most scarce and also exhibited the lowest DAFs on average. Unsurprisingly, DAF values were notably decreased for nonsynonymous coding SNPs. Perhaps more surprising are synonymous SNPs, which allegedly have no functional relevance, but nevertheless have lower allele frequencies than UTR, intronic, and intergenic SNPs. It would be useful if dbSNP offered some further classifications (e.g. "promoter") of putative functional relevance.
fxn_classsnpsYRICEUCHBJPT
unknown1,464,3720.25210.28250.28170.2816
locus71,9120.24390.27380.27330.2731
intron835,7610.23860.26900.26910.2690
mrna-utr139,0460.23600.26570.26520.2654
coding-syn13,4470.21240.23650.24080.2419
coding-non15,0610.16650.19310.19660.1970
splice-site2650.03710.03950.04920.0500

7 comments ( 16 views )   |  0 trackbacks   |  related link   |   ( 3.1 / 125 )

SNPs in human cytochrome P450 genes 
Posted by Dan Koboldt on Friday, April 6, 2007, 04:00 PM
I recently used our SNPseek web tool to perform an analysis of genetic (SNP) variation across cytochrome P450 genes in humans. A simple search by product name retrieved 53 CYP genes from the UCSC Known Genes (hg18). I plugged these gene symbols into SNPseek's analysis tool to retrieve a comprehensive report of the SNPs in CYP 450 genes according to dbSNP (b126).

SNPseek reported some 6,658 SNPs across the 53 CYP loci; of the 2,120 variants that were characterized by the HapMap Project just over half (1,093) were polymorphic.

Because nonsynonymous SNPs are of particular interest, I retrieved the 346 'coding-nonsynon' variants in CYP genes from our modified database of amino acid variants (coming soon). SNPseek had data for 328 of these; more than half (189) were also classified as either human-rodent or human-vertebrate conserved. Looking at the 105 nsSNPs for which HapMap data was available, it appears that amino acid variants in CYP genes have relatively high rates of monomorphism (42.35%) and population-specificity (24.71%) compared to nsSNPs overall. Even more striking was the low incidence of "common link" SNPs (5.88%).

To me, these patterns fit nicely with the expectation of purifying selection acting on the coding sequences of genes for CYP 450 enzymes.


4 comments ( 30 views )   |  0 trackbacks   |  related link   |   ( 3 / 129 )

Modeling the genetics of complex traits in mice 
Posted by Dan Koboldt on Friday, March 30, 2007, 02:07 PM
A fascinating talk was given by Joseph Nadeau (Case Western) at this week's Genetics departmental seminar. He described a long-term collaborative project with Eric Lander in which they studied metabolic traits in mice, particularly resistance to diet-induced obesity. They put two established mouse strains (A/J and BL/6) on a high-fat, high-sugar diet (the rodent equivalent of a Big Mac and large Coke every day). Despite the fact that A/J mice ate more and were less active, they stayed thin while BL/6 mice developed obesity, hypertension, insulin resistance - your complete cardiovascular disease package. Over the course of 7 years (starting in '96) they genotyped 17,000 mice and developed a panel of 22 Chromosome Substitution Strains (CSSs) that you can get from Jackson Labs. They made all kinds of interesting observations; here are five of the most striking ones:
1. Complexity. Whereas previous mouse obesity mapping studies came up with 2-4 loci per trait, Nadeau and Lander's system makes it possible to find more QTLs (8 CSSs per trait on average) with small effects
2. Effect Size. They expected to find many QTLs with small effects. Instead, they found many QTLs with large (51% on average) effects. For example, 8 CSSs account for 99.8% of variation in cholesterol levels among mice on the "diet".
3. Fractal Genetics.They observed a similarly large number of large effects on resistance to diet-induced obesity at the chromosome, congenic, and sub-congenic levels.
4. Epistasis. Some 20 genes conferred resistance to diet-induced obesity, but all of their effects were non-additive.
5. Alternating Stable States. In fact, the effect of having more genes reversed the phenotype completely. One gene = obese, 2 genes = lean, 3 genes = obese, 4 genes = lean.

I think that everyone left the room with a new appreciation for "systems biology" in mice to study the genetic architecture of complex traits.
1 comment ( 3 views )   |  0 trackbacks   |  related link   |   ( 2.9 / 168 )


Next