Honors & Awards
CAREER, NSF (2003-08)
PostDoctoral, Stanford, Genetics (2000)
Ph D, Stanford, Statistics (1998)
BS & MS, Bocconi University, Statistics and Economics (1993)
Statistical models and reasoning are key to our understanding of the genetic basis of human traits. Modern high-throughput technology presents us with new opportunities and challenges. We develop statistical approaches for high dimensional data in the attempt of improving our understanding of the molecular basis of health related traits.
Tourette's syndrome (TS) is a developmental disorder that has one of the highest familial recurrence rates among neuropsychiatric diseases with complex inheritance. However, the identification of definitive TS susceptibility genes remains elusive. Here, we report the first genome-wide association study (GWAS) of TS in 1285 cases and 4964 ancestry-matched controls of European ancestry, including two European-derived population isolates, Ashkenazi Jews from North America and Israel and French Canadians from Quebec, Canada. In a primary meta-analysis of GWAS data from these European ancestry samples, no markers achieved a genome-wide threshold of significance (P<5 × 10(-8)); the top signal was found in rs7868992 on chromosome 9q32 within COL27A1 (P=1.85 × 10(-6)). A secondary analysis including an additional 211 cases and 285 controls from two closely related Latin American population isolates from the Central Valley of Costa Rica and Antioquia, Colombia also identified rs7868992 as the top signal (P=3.6 × 10(-7) for the combined sample of 1496 cases and 5249 controls following imputation with 1000 Genomes data). This study lays the groundwork for the eventual identification of common TS susceptibility variants in larger cohorts and helps to provide a more complete understanding of the full genetic architecture of this disorder.
View details for DOI 10.1038/mp.2012.69
View details for Web of Science ID 000319451600015
View details for PubMedID 22889924
Genomic copy number variations (CNVs) and increased parental age are both associated with the risk to develop a variety of clinical neuropsychiatric disorders such as autism, schizophrenia and bipolar disorder. At the same time, it has been shown that the rate of transmitted de novo single nucleotide mutations is increased with paternal age. To address whether paternal age also affects the burden of structural genomic deletions and duplications, we examined various types of CNV burden in a large population sample from the Netherlands. Healthy participants with parental age information (n = 6,773) were collected at different University Medical Centers. CNVs were called with the PennCNV algorithm using Illumina genome-wide SNP array data. We observed no evidence in support of a paternal age effect on CNV load in the offspring. Our results were negative for global measures as well as several proxies for de novo CNV events in this unique sample. While recent studies suggest de novo single nucleotide mutation rate to be dominated by the age of the father at conception, our results strongly suggest that at the level of global CNV burden there is no influence of increased paternal age. While it remains possible that local genomic effects may exist for specific phenotypes, this study indicates that global CNV burden and increased father's age may be independent disease risk factors.
View details for DOI 10.1007/s00439-012-1261-4
View details for Web of Science ID 000316345400008
View details for PubMedID 23315237
Variations in DNA copy number carry information on the modalities of genome evolution and mis-regulation of DNA replication in cancer cells. Their study can help localize tumor suppressor genes, distinguish different populations of cancerous cells, and identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to identify copy number variable sites, and the literature documents multiple effective algorithms. We focus here on the specific problem of detecting regions where variation in copy number is relatively common in the sample at hand. This problem encompasses the cases of copy number polymorphisms, related samples, technical replicates, and cancerous sub-populations from the same individual.We present a segmentation method named generalized fused lasso (GFL) to reconstruct copy number variant regions. GFL is based on penalized estimation and is capable of processing multiple signals jointly. Our approach is computationally very attractive and leads to sensitivity and specificity levels comparable to those of state-of-the-art specialized methodologies. We illustrate its applicability with simulated and real data sets.The flexibility of our framework makes it applicable to data obtained with a wide range of technology. Its versatility and speed make GFL particularly useful in the initial screening stages of large data sets.
View details for DOI 10.1186/1471-2105-13-205
View details for Web of Science ID 000313262200001
View details for PubMedID 22897923
Temperament has a strongly heritable component, yet multiple independent genome-wide studies have failed to identify significant genetic associations. We have assembled the largest sample to date of persons with genome-wide genotype data, who have been assessed with Cloninger's Temperament and Character Inventory. Sum scores for novelty seeking, harm avoidance, reward dependence and persistence have been measured in over 11,000 persons collected in four different cohorts. Our study had >80% power to identify genome-wide significant loci (P<1.25 × 10(-8), with correction for testing four scales) accounting for ?0.4% of the phenotypic variance in temperament scales. Using meta-analysis techniques, gene-based tests and pathway analysis we have tested over 1.2 million single-nucleotide polymorphisms (SNPs) for association to each of the four temperament dimensions. We did not discover any SNPs, genes, or pathways to be significantly related to the four temperament dimensions, after correcting for multiple testing. Less than 1% of the variability in any temperament dimension appears to be accounted for by a risk score derived from the SNPs showing strongest association to the temperament dimensions. Elucidation of genetic loci significantly influencing temperament and personality will require potentially very large samples, and/or a more refined phenotype. Item response theory methodology may be a way to incorporate data from cohorts assessed with multiple personality instruments, and might be a method by which a large sample of a more refined phenotype could be acquired.
View details for DOI 10.1038/tp.2012.37
View details for Web of Science ID 000312895700009
View details for PubMedID 22832960
Since 2008, multiple studies have reported on copy number variations (CNVs) in schizophrenia. However, many regions are unique events with minimal overlap between studies. This makes it difficult to gain a comprehensive overview of all CNVs involved in the etiology of schizophrenia. We performed a systematic CNV study on the basis of a homogeneous genome-wide dataset aiming at all CNVs ? 50 kilobase pair. We complemented this analysis with a review of cytogenetic and chromosomal abnormalities for schizophrenia reported in the literature with the purpose of combining classical genetic findings and our current understanding of genomic variation.We investigated 834 Dutch schizophrenia patients and 672 Dutch control subjects. The CNVs were included if they were detected by QuantiSNP (http://www.well.ox.ac.uk/QuantiSNP/) as well as PennCNV (http://www.neurogenome.org/cnv/penncnv/) and contain known protein coding genes. The integrated identification of CNV regions and cytogenetic loci indicates regions of interest (cytogenetic regions of interest [CROIs]).In total, 2437 CNVs were identified with an average number of 2.1 CNVs/subject for both cases and control subjects. We observed significantly more deletions but not duplications in schizophrenia cases versus control subjects. The CNVs identified coincide with loci previously reported in the literature, confirming well-established schizophrenia CROIs 1q42 and 22q11.2 as well as indicating a potentially novel CROI on chromosome 5q35.1.Chromosomal deletions are more prevalent in schizophrenia patients than in healthy subjects and therefore confer a risk factor for pathogenicity. The combination of our CNV data with previously reported cytogenetic abnormalities in schizophrenia provides an overview of potentially interesting regions for positional candidate genes.
View details for DOI 10.1016/j.biopsych.2011.02.015
View details for Web of Science ID 000295595800009
View details for PubMedID 21489405
Glioblastoma (GBM) is among the most lethal of all cancers. GBM consist of a heterogeneous population of tumor cells among which a tumor-initiating and treatment-resistant subpopulation, here termed GBM stem cells, have been identified as primary therapeutic targets. Here, we describe a high-throughput small molecule screening approach that enables the identification and characterization of chemical compounds that are effective against GBM stem cells. The paradigm uses a tissue culture model to enrich for GBM stem cells derived from human GBM resections and combines a phenotype-based screen with gene target-specific screens for compound identification. We used 31,624 small molecules from 7 chemical libraries that we characterized and ranked based on their effect on a panel of GBM stem cell-enriched cultures and their effect on the expression of a module of genes whose expression negatively correlates with clinical outcome: MELK, ASPM, TOP2A, and FOXM1b. Of the 11 compounds meeting criteria for exerting differential effects across cell types used, 4 compounds showed selectivity by inhibiting multiple GBM stem cells-enriched cultures compared with nonenriched cultures: emetine, n-arachidonoyl dopamine, n-oleoyldopamine (OLDA), and n-palmitoyl dopamine. ChemBridge compounds #5560509 and #5256360 inhibited the expression of the 4 mitotic module genes. OLDA, emetine, and compounds #5560509 and #5256360 were chosen for more detailed study and inhibited GBM stem cells in self-renewal assays in vitro and in a xenograft model in vivo. These studies show that our screening strategy provides potential candidates and a blueprint for lead compound identification in larger scale screens or screens involving other cancer types.
View details for DOI 10.1158/1535-7163.MCT-11-0268
View details for Web of Science ID 000295968200006
View details for PubMedID 21859839
Phenotype mining is a novel approach for elucidating the genetic basis of complex phenotypic variation. It involves a search of rich phenotype databases for measures correlated with genetic variation, as identified in genome-wide genotyping or sequencing studies. An initial implementation of phenotype mining in a prospective unselected population cohort, the Northern Finland 1966 Birth Cohort (NFBC1966), identifies neurodevelopment-related traits-intellectual deficits, poor school performance and hearing abnormalities-which are more frequent among individuals with large (>500 kb) deletions than among other cohort members. Observation of extensive shared single nucleotide polymorphism haplotypes around deletions suggests an opportunity to expand phenotype mining from cohort samples to the populations from which they derive.
View details for DOI 10.1093/hmg/ddr162
View details for Web of Science ID 000291527000018
View details for PubMedID 21505072
Plasma concentrations of total cholesterol, low-density lipoprotein cholesterol, high-density lipoprotein cholesterol and triglycerides are among the most important risk factors for coronary artery disease (CAD) and are targets for therapeutic intervention. We screened the genome for common variants associated with plasma lipids in >100,000 individuals of European ancestry. Here we report 95 significantly associated loci (P < 5 x 10(-8)), with 59 showing genome-wide significant association with lipid traits for the first time. The newly reported associations include single nucleotide polymorphisms (SNPs) near known lipid regulators (for example, CYP7A1, NPC1L1 and SCARB1) as well as in scores of loci not previously implicated in lipoprotein metabolism. The 95 loci contribute not only to normal variation in lipid traits but also to extreme lipid phenotypes and have an impact on lipid traits in three non-European populations (East Asians, South Asians and African Americans). Our results identify several novel loci associated with plasma lipids that are also associated with CAD. Finally, we validated three of the novel genes-GALNT2, PPP1R3B and TTC39B-with experiments in mouse models. Taken together, our findings provide the foundation to develop a broader biological understanding of lipoprotein metabolism and to identify new therapeutic opportunities for the prevention of CAD.
View details for DOI 10.1038/nature09270
View details for Web of Science ID 000280562500029
View details for PubMedID 20686565
Although genome-wide association studies (GWASs) have identified numerous loci associated with complex traits, imprecise modeling of the genetic relatedness within study samples may cause substantial inflation of test statistics and possibly spurious associations. Variance component approaches, such as efficient mixed-model association (EMMA), can correct for a wide range of sample structures by explicitly accounting for pairwise relatedness between individuals, using high-density markers to model the phenotype distribution; but such approaches are computationally impractical. We report here a variance component approach implemented in publicly available software, EMMA eXpedited (EMMAX), that reduces the computational time for analyzing large GWAS data sets from years to hours. We apply this method to two human GWAS data sets, performing association analysis for ten quantitative traits from the Northern Finland Birth Cohort and seven common diseases from the Wellcome Trust Case Control Consortium. We find that EMMAX outperforms both principal component analysis and genomic control in correcting for sample structure.
View details for DOI 10.1038/ng.548
View details for Web of Science ID 000276150500016
View details for PubMedID 20208533
Previous studies have implicated DTNBP1 as a schizophrenia susceptibility gene and its encoded protein, dysbindin, as a potential regulator of synaptic vesicle physiology. In this study, we found that endogenous levels of the dysbindin protein in the mouse brain are developmentally regulated, with higher levels observed during embryonic and early postnatal ages than in young adulthood. We obtained biochemical evidence indicating that the bulk of dysbindin from brain exists as a stable component of biogenesis of lysosome-related organelles complex-1 (BLOC-1), a multi-subunit protein complex involved in intracellular membrane trafficking and organelle biogenesis. Selective biochemical interaction between brain BLOC-1 and a few members of the SNARE (soluble N-ethylmaleimide-sensitive factor attachment protein receptor) superfamily of proteins that control membrane fusion, including SNAP-25 and syntaxin 13, was demonstrated. Furthermore, primary hippocampal neurons deficient in BLOC-1 displayed neurite outgrowth defects. Taken together, these observations suggest a novel role for the dysbindin-containing complex, BLOC-1, in neurodevelopment, and provide a framework for considering potential effects of allelic variants in DTNBP1--or in other genes encoding BLOC-1 subunits--in the context of the developmental model of schizophrenia pathogenesis.
View details for DOI 10.1038/mp.2009.58
View details for PubMedID 19546860
Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genotyping platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by Tibshirani and Wang [Biostatistics 9 (2008) 18-29]. We mount a fresh attack on this difficult optimization problem by the following: (a) changing the penalty terms slightly by substituting a smooth approximation to the absolute value function, (b) designing and implementing a new MM (majorization-minimization) algorithm, and (c) applying a fast version of Newton's method to jointly update all model parameters. Together these changes enable us to minimize the fused-lasso criterion in a highly effective way.We also reframe the reconstruction problem in terms of imputation via discrete optimization. This approach is easier and more accurate than parameter estimation because it relies on the fact that only a handful of possible copy number states exist at each SNP. The dynamic programming framework has the added bonus of exploiting information that the current fused-lasso approach ignores. The accuracy of our imputations is comparable to that of hidden Markov models at a substantially lower computational cost.
View details for PubMedID 21572975
In many organisms the expression levels of each gene are controlled by the activation levels of known "Transcription Factors" (TF). A problem of considerable interest is that of estimating the "Transcription Regulation Networks" (TRN) relating the TFs and genes. While the expression levels of genes can be observed, the activation levels of the corresponding TFs are usually unknown, greatly increasing the difficulty of the problem. Based on previous experimental work, it is often the case that partial information about the TRN is available. For example, certain TFs may be known to regulate a given gene or in other cases a connection may be predicted with a certain probability. In general, the biology of the problem indicates there will be very few connections between TFs and genes. Several methods have been proposed for estimating TRNs. However, they all suffer from problems such as unrealistic assumptions about prior knowledge of the network structure or computational limitations. We propose a new approach that can directly utilize prior information about the network structure in conjunction with observed gene expression data to estimate the TRN. Our approach uses L(1) penalties on the network to ensure a sparse structure. This has the advantage of being computationally efficient as well as making many fewer assumptions about the network structure. We use our methodology to construct the TRN for E. coli and show that the estimate is biologically sensible and compares favorably with previous estimates.
View details for PubMedID 21625366
We previously reported linkage of bipolar disorder to 5q33-q34 in families from two closely related population isolates, the Central Valley of Costa Rica (CVCR) and Antioquia, Colombia (CO). Here we present follow up results from fine-scale mapping in large CVCR and CO families segregating severe bipolar disorder, BP-I, and in 343 population trios/duos from CVCR and CO. Employing densely spaced SNPs to fine map the prior linkage peak region increases linkage evidence and clarifies the position of the putative BP-I locus. We performed two-point linkage analysis with 1134 SNPs in an approximately 9 Mb region between markers D5S410 and D5S422. Combining pedigrees from CVCR and CO yields a LOD score of 4.9 at SNP rs10035961. Two other SNPs (rs7721142 and rs1422795) within the same 94 kb region also displayed LOD scores greater than 4. This linkage peak coincides with our prior microsatellite results and suggests a narrowed BP-I susceptibility regions in these families. To investigate if the locus implicated in the familial form of BP-I also contributes to disease risk in the population, we followed up the family results with association analysis in duo and trio samples, obtaining signals within 2 Mb of the peak linkage signal in the pedigrees; rs12523547 and rs267015 (P = 0.00004 and 0.00016, respectively) in the CO sample and rs244960 in the CVCR sample and the combined sample, with P = 0.00032 and 0.00016, respectively. It remains unclear whether these association results reflect the same locus contributing to BP susceptibility within the extended pedigrees.
View details for DOI 10.1002/ajmg.b.30956
View details for Web of Science ID 000270441100014
View details for PubMedID 19319892
Down Syndrome cell adhesion molecule (Dscam) genes encode neuronal cell recognition proteins of the immunoglobulin superfamily. In Drosophila, Dscam1 generates 19,008 different ectodomains by alternative splicing of three exon clusters, each encoding half or a complete variable immunoglobulin domain. Identical isoforms bind to each other, but rarely to isoforms differing at any one of the variable immunoglobulin domains. Binding between isoforms on opposing membranes promotes repulsion. Isoform diversity provides the molecular basis for neurite self-avoidance. Self-avoidance refers to the tendency of branches from the same neuron (self-branches) to selectively avoid one another. To ensure that repulsion is restricted to self-branches, different neurons express different sets of isoforms in a biased stochastic fashion. Genetic studies demonstrated that Dscam1 diversity has a profound role in wiring the fly brain. Here we show how many isoforms are required to provide an identification system that prevents non-self branches from inappropriately recognizing each other. Using homologous recombination, we generated mutant animals encoding 12, 24, 576 and 1,152 potential isoforms. Mutant animals with deletions encoding 4,752 and 14,256 isoforms were also analysed. Branching phenotypes were assessed in three classes of neurons. Branching patterns improved as the potential number of isoforms increased, and this was independent of the identity of the isoforms. Although branching defects in animals with 1,152 potential isoforms remained substantial, animals with 4,752 isoforms were indistinguishable from wild-type controls. Mathematical modelling studies were consistent with the experimental results that thousands of isoforms are necessary to ensure acquisition of unique Dscam1 identities in many neurons. We conclude that thousands of isoforms are essential to provide neurons with a robust discrimination mechanism to distinguish between self and non-self during self-avoidance.
View details for DOI 10.1038/nature08431
View details for Web of Science ID 000270302600039
View details for PubMedID 19794492
Deletions within the neurexin 1 gene (NRXN1; 2p16.3) are associated with autism and have also been reported in two families with schizophrenia. We examined NRXN1, and the closely related NRXN2 and NRXN3 genes, for copy number variants (CNVs) in 2977 schizophrenia patients and 33 746 controls from seven European populations (Iceland, Finland, Norway, Germany, The Netherlands, Italy and UK) using microarray data. We found 66 deletions and 5 duplications in NRXN1, including a de novo deletion: 12 deletions and 2 duplications occurred in schizophrenia cases (0.47%) compared to 49 and 3 (0.15%) in controls. There was no common breakpoint and the CNVs varied from 18 to 420 kb. No CNVs were found in NRXN2 or NRXN3. We performed a Cochran-Mantel-Haenszel exact test to estimate association between all CNVs and schizophrenia (P = 0.13; OR = 1.73; 95% CI 0.81-3.50). Because the penetrance of NRXN1 CNVs may vary according to the level of functional impact on the gene, we next restricted the association analysis to CNVs that disrupt exons (0.24% of cases and 0.015% of controls). These were significantly associated with a high odds ratio (P = 0.0027; OR 8.97, 95% CI 1.8-51.9). We conclude that NRXN1 deletions affecting exons confer risk of schizophrenia.
View details for DOI 10.1093/hmg/ddn351
View details for Web of Science ID 000263409100017
View details for PubMedID 18945720
Genome-wide association studies (GWAS) of longitudinal birth cohorts enable joint investigation of environmental and genetic influences on complex traits. We report GWAS results for nine quantitative metabolic traits (triglycerides, high-density lipoprotein, low-density lipoprotein, glucose, insulin, C-reactive protein, body mass index, and systolic and diastolic blood pressure) in the Northern Finland Birth Cohort 1966 (NFBC1966), drawn from the most genetically isolated Finnish regions. We replicate most previously reported associations for these traits and identify nine new associations, several of which highlight genes with metabolic functions: high-density lipoprotein with NR1H3 (LXRA), low-density lipoprotein with AR and FADS1-FADS2, glucose with MTNR1B, and insulin with PANK1. Two of these new associations emerged after adjustment of results for body mass index. Gene-environment interaction analyses suggested additional associations, which will require validation in larger samples. The currently identified loci, together with quantified environmental exposures, explain little of the trait variation in NFBC1966. The association observed between low-density lipoprotein and an infrequent variant in AR suggests the potential of such a cohort for identifying associations with both common, low-impact and rarer, high-impact quantitative trait loci.
View details for DOI 10.1038/ng.271
View details for Web of Science ID 000262085300014
View details for PubMedID 19060910
Recent genome-wide association (GWA) studies of lipids have been conducted in samples ascertained for other phenotypes, particularly diabetes. Here we report the first GWA analysis of loci affecting total cholesterol (TC), low-density lipoprotein (LDL) cholesterol, high-density lipoprotein (HDL) cholesterol and triglycerides sampled randomly from 16 population-based cohorts and genotyped using mainly the Illumina HumanHap300-Duo platform. Our study included a total of 17,797-22,562 persons, aged 18-104 years and from geographic regions spanning from the Nordic countries to Southern Europe. We established 22 loci associated with serum lipid levels at a genome-wide significance level (P < 5 x 10(-8)), including 16 loci that were identified by previous GWA studies. The six newly identified loci in our cohort samples are ABCG5 (TC, P = 1.5 x 10(-11); LDL, P = 2.6 x 10(-10)), TMEM57 (TC, P = 5.4 x 10(-10)), CTCF-PRMT8 region (HDL, P = 8.3 x 10(-16)), DNAH11 (LDL, P = 6.1 x 10(-9)), FADS3-FADS2 (TC, P = 1.5 x 10(-10); LDL, P = 4.4 x 10(-13)) and MADD-FOLH1 region (HDL, P = 6 x 10(-11)). For three loci, effect sizes differed significantly by sex. Genetic risk scores based on lipid loci explain up to 4.8% of variation in lipids and were also associated with increased intima media thickness (P = 0.001) and coronary heart disease incidence (P = 0.04). The genetic risk score improves the screening of high-risk groups of dyslipidemia over classical risk factors.
View details for DOI 10.1038/ng.269
View details for Web of Science ID 000262085300015
View details for PubMedID 19060911
Illumina genotyping arrays provide information on DNA copy number. Current methodology for their analysis assumes linkage equilibrium across adjacent markers. This is unrealistic, given the markers high density, and can result in reduced specificity. Another limitation of current methods is that they cannot be directly applied to the analysis of multiple samples with the goal of detecting copy number polymorphisms and their association with traits of interest.We propose a new Hidden Markov Model for Illumina genotype data, that takes into account linkage disequilibrium between adjacent loci. Our framework also allows for location specific deletion/duplication rates. When multiple samples are available, we describe a methodology for their analysis that simultaneously reconstructs the copy number states in each sample and identifies genomic locations with increased variability in copy number in the population. This approach can be extended to test association between copy number variants and a disease trait.We show that taking into account linkage disequilibrium between adjacent markers can increase the specificity of a HMM in reconstructing copy number variants, especially single copy deletions. Our multisample approach is computationally practical and can increase the power of association studies.
View details for DOI 10.1159/000210445
View details for Web of Science ID 000265122300001
View details for PubMedID 19339782
Schizophrenia is a severe psychiatric disease with complex etiology, affecting approximately 1% of the general population. Most genetics studies so far have focused on disease association with common genetic variation, such as single-nucleotide polymorphisms (SNPs), but it has recently become apparent that large-scale genomic copy-number variants (CNVs) are involved in disease development as well. To assess the role of rare CNVs in schizophrenia, we screened 54 patients with deficit schizophrenia using Affymetrix's GeneChip 250K SNP arrays. We identified 90 CNVs in total, 77 of which have been reported previously in unaffected control cohorts. Among the genes disrupted by the remaining rare CNVs are MYT1L, CTNND2, NRXN1, and ASTN2, genes that play an important role in neuronal functioning but--except for NRXN1--have not been associated with schizophrenia before. We studied the occurrence of CNVs at these four loci in an additional cohort of 752 patients and 706 normal controls from The Netherlands. We identified eight additional CNVs, of which the four that affect coding sequences were found only in the patient cohort. Our study supports a role for rare CNVs in schizophrenia susceptibility and identifies at least three candidate genes for this complex disorder.
View details for DOI 10.1016/j.ajhg.2008.09.011
View details for Web of Science ID 000260239200008
View details for PubMedID 18940311
Reduced fecundity, associated with severe mental disorders, places negative selection pressure on risk alleles and may explain, in part, why common variants have not been found that confer risk of disorders such as autism, schizophrenia and mental retardation. Thus, rare variants may account for a larger fraction of the overall genetic risk than previously assumed. In contrast to rare single nucleotide mutations, rare copy number variations (CNVs) can be detected using genome-wide single nucleotide polymorphism arrays. This has led to the identification of CNVs associated with mental retardation and autism. In a genome-wide search for CNVs associating with schizophrenia, we used a population-based sample to identify de novo CNVs by analysing 9,878 transmissions from parents to offspring. The 66 de novo CNVs identified were tested for association in a sample of 1,433 schizophrenia cases and 33,250 controls. Three deletions at 1q21.1, 15q11.2 and 15q13.3 showing nominal association with schizophrenia in the first sample (phase I) were followed up in a second sample of 3,285 cases and 7,951 controls (phase II). All three deletions significantly associate with schizophrenia and related psychoses in the combined sample. The identification of these rare, recurrent risk variants, having occurred independently in multiple founders and being subject to negative selection, is important in itself. CNV analysis may also point the way to the identification of additional and more prevalent risk variants in genes and pathways involved in schizophrenia.
View details for DOI 10.1038/nature07229
View details for Web of Science ID 000259090800049
View details for PubMedID 18668039
To investigate the clinical features and natural history of mal de debarquement (MdD).Retrospective case review with follow-up questionnaire and telephone interviews.University Neurotology Clinic.Patients seen between 1980 and 2006 who developed a persistent sensation of rocking or swaying for at least 3 days after exposure to passive motion.Clinical features,diagnostic testing, and questionnaire responses.Of 64 patients(75% women) identified with MdD, 34 completed follow-up questionnaires and interviews in 2006. Most patients had normal neurological exams, ENGs and brain MRIs. The average age of the first MdD episode was 39+/-13 years. A total of 206 episodes were experienced by 64 patients. Of these, 104 episodes (51%) lasted>1 month; 18%, >1 year; 15%, >2 years; 12%, >4 years, and 11%, >5 years. Eighteen patients (28%) subsequently developed spontaneous episodes of MdD-like symptoms after the initial MdD episode.There was a much higher rate of migraine in patients who went onto develop spontaneous episodes(73%) than in those who did not(22%). Subsequent episodes were longer than earlier ones in most patients who had multiple episodes.Re-exposure to passive motion temporarily decreased symptoms in most patients (66%).Subjective intolerance to visual motion increased (10% to 66%)but self-motion sensitivity did not(37% to 50%) with onset of MdD.The majority of MdD episodes lasting longer than 3 days resolve in less than one year but the probability of resolution declines each year. Many patients experience multiple MdD episodes. Some patients develop spontaneous episodes after the initial motion-triggered episode with migraine being a risk factor.
View details for DOI 10.1007/s00415-008-0837-3
View details for Web of Science ID 000258025000014
View details for PubMedID 18500497
Affymetrix's SNP (single-nucleotide polymorphism) genotyping chips have increased the scope and decreased the cost of gene-mapping studies. Because each SNP is queried by multiple DNA probes, the chips present interesting challenges in genotype calling. Traditional clustering methods distinguish the three genotypes of an SNP fairly well given a large enough sample of unrelated individuals or a training sample of known genotypes. This article describes our attempt to improve genotype calling by constructing Gaussian mixture models with empirically derived priors. The priors stabilize parameter estimation and borrow information collectively gathered on tens of thousands of SNPs. When data from related family members are available, our models capture the correlations in signals between relatives. With these advantages in mind, we apply the models to Affymetrix probe intensity data on 10,000 SNPs gathered on 63 genotyped individuals spread over eight pedigrees. We integrate the genotype-calling model with pedigree analysis and examine a sequence of symmetry hypotheses involving the correlated probe signals. The symmetry hypotheses raise novel mathematical issues of parameterization. Using the Bayesian information criterion, we select the best combination of symmetry assumptions. Compared to Affymetrix's software, our model leads to a reduction in no-calls with little sacrifice in overall calling accuracy.
View details for PubMedID 21572926
We propose a new method for haplotyping, genotype calling, and association testing based on a dictionary model for haplotypes. In this framework, a haplotype arises as a concatenation of conserved haplotype segments, drawn from a predefined dictionary according to segment specific probabilities. The observed data consist of unphased multimarker genotypes gathered on a random sample of unrelated individuals. These genotypes are subject to mutation, genotyping errors, and missing data. The true pair of haplotypes corresponding to a person's multimarker genotype is reconstructed using a Markov chain that visits haplotype pairs according to their posterior probabilities. Our implementation of the chain alternates Gibbs steps, which rearrange the phase of a single marker, and Metropolis steps, which swap maternal and paternal haplotypes from a given maker onward. Output of the chain include the most likely haplotype pairs, the most likely genotypes at each marker, and the expected number of occurrences of each haplotype segment. Reconstruction accuracy is comparable to that achieved by the best existing algorithms. More importantly, the dictionary model yields expected counts of conserved haplotype segments. These imputed counts can serve as genetic predictors in association studies, as we illustrate by examples on cystic fibrosis, Friedreich's ataxia, and angiotensin-I converting enzyme levels.
View details for DOI 10.1002/gepi.20232
View details for Web of Science ID 000250904800002
View details for PubMedID 17487885
Population isolates may be particularly useful for association studies of complex traits. This utility, however, largely depends on the transferability of tag SNPs chosen from reference samples, such as HapMap, to samples from such populations. Factors that characterize population isolates, such as widespread genetic drift, could impede such transferability. In this report, we show that tag SNPs chosen from HapMap perform well in several population isolates; this is true even for populations that differ substantially from the HapMap sample either in levels of linkage disequilibrium or in SNP allele frequency distributions.
View details for DOI 10.1002/gepi.20201
View details for Web of Science ID 000245128200002
View details for PubMedID 17323370
Coexistent migraine affects relevant clinical features of patients with Ménière's disease (MD).Epidemiological studies have shown an association between migraine and MD. We sought to determine whether the coexistence of migraine affects any clinical features in patients with MD.In this retrospective case-control study of University Neurotology Clinic patients, 50 patients meeting 1995 AAO-HNS criteria for definite MD were compared to 18 patients meeting the same criteria in addition to the 2004 IHS criteria for migraine (MMD). All had typical low frequency sensorineural hearing loss and episodes of rotational vertigo. Outcome measures included: sex, age of onset of episodic vertigo or fluctuating hearing loss, laterality of hearing loss, aural symptoms, caloric responses, severity of hearing loss, and family history of migraine, episodic vertigo or hearing loss.Age of onset of episodic vertigo or fluctuating hearing loss was significantly lower in patients with MMD (mean +/- 1.96*SE = 37.2 +/- 6.3 years) than in those with MD (mean +/- 1.96*SE = 49.3 +/- 4.4 years). Concurrent bilateral aural symptoms and hearing loss were seen in 56% of MMD and 4% of MD patients. A family history of episodic vertigo was seen in 39% of MMD and 2% of MD patients.
View details for DOI 10.1080/00016480701242469
View details for Web of Science ID 000251240600002
View details for PubMedID 17851970
We consider the problem of controlling false discoveries in association studies. We assume that the design of the study is adequate so that the "false discoveries" are potentially only because of random chance, not to confounding or other flaws. Under this premise, we review the statistical framework for hypothesis testing and correction for multiple comparisons. We consider in detail the currently accepted strategies in linkage analysis. We then examine the underlying similarities and differences between linkage and association studies and document some of the most recent methodological developments for association mapping.
View details for PubMedID 17984547
Defining measures of linkage disequilibrium (LD) that have good small sample properties and are applicable to multiallelic markers poses some challenges. The potential of volume measures in this context has been noted before, but their use has been hampered by computational challenges.We design a sequential importance sampling algorithm to evaluate volume measures on I x J tables. The algorithm is implemented in a C routine as a complement to exhaustive enumeration. We make the C code available as open source. We achieve fast and accurate evaluation of volume measures in two dimensional tables.Applying our code to simulated and real datasets reinforces the belief that volume measures are a very useful tool for LD evaluation: they are not inflated in small samples, their definition encompasses multiallelic markers, and they can be computed with appreciable speed.
View details for DOI 10.1186/1471-2156-7-54
View details for Web of Science ID 000242380800001
View details for PubMedID 17112381
Rare sequence variants may be important in understanding the biology of common diseases, but clearly establishing their association with disease is often difficult. Association studies of such variants are becoming increasingly common as large-scale sequence analysis of candidate genes has become feasible. A recent report suggested SLITRK1 (Slit and Trk-like 1) as a candidate gene for Tourette Syndrome (TS). The statistical evidence for this suggestion came from association analyses of a rare 3'-UTR variant, var321, which was observed in two patients but not observed in more than 2000 controls. We genotyped 307 Costa Rican and 515 Ashkenazi individuals (TS probands and their parents) and observed var321 in five independent Ashkenazi parents, two of whom did not transmit this variant to their affected child. Furthermore, we identified var321 in one subject from an Ashkenazi control sample. Our findings do not support the previously reported association and suggest that var321 is overrepresented among Ashkenazi Jews compared with other populations of European origin. The results further suggest that overrepresentation of rare variants in a specific ethnic group may complicate the interpretation of association analyses of such variants, highlighting the particular importance of precisely matching case and control populations for association analyses of rare variants.
View details for DOI 10.1093/hmg/ddl408
View details for Web of Science ID 000241629900006
View details for PubMedID 17035247
We performed a whole genome microsatellite marker scan in six multiplex families with bipolar (BP) mood disorder ascertained in Antioquia, a historically isolated population from North West Colombia. These families were characterized clinically using the approach employed in independent ongoing studies of BP in the closely related population of the Central Valley of Costa Rica. The most consistent linkage results from parametric and non-parametric analyses of the Colombian scan involved markers on 5q31-33, a region implicated by the previous studies of BP in Costa Rica. Because of these concordant results, a follow-up study with additional markers was undertaken in an expanded set of Colombian and Costa Rican families; this provided a genome-wide significant evidence of linkage of BPI to a candidate region of approximately 10 cM in 5q31-33 (maximum non-parametric linkage score=4.395, P<0.00004). Interestingly, this region has been implicated in several previous genetic studies of schizophrenia and psychosis, including disease association with variants of the enthoprotin and gamma-aminobutyric acid receptor genes.
View details for DOI 10.1093/hmg/ddl254
View details for Web of Science ID 000241430000006
View details for PubMedID 16984960
We have ascertained in the Central Valley of Costa Rica a new kindred (CR201) segregating for severe bipolar disorder (BP-I). The family was identified by tracing genealogical connections among eight persons initially independently ascertained for a genome wide association study of BP-I. For the genome screen in CR201, we trimmed the family down to 168 persons (82 of whom are genotyped), containing 25 individuals with a best-estimate diagnosis of BP-I. A total of 4,690 SNP markers were genotyped. Analysis of the data was hampered by the size and complexity of the pedigree, which prohibited using exact multipoint methods on the entire kindred. Two-point parametric linkage analysis, using a conservative model of transmission, produced a maximum LOD score of 2.78 on chromosome 6, and a total of 39 loci with LOD scores >1.0. Multipoint parametric and non-parametric linkage analysis was performed separately on four sections of CR201, and interesting (nominal P-value from either analysis <0.01), although not statistically significant, regions were highlighted on chromosomes 1, 2, 3, 12, 16, 19, and 22, in at least one section of the pedigree, or when considering all sections together. The difficulties of analyzing genome wide SNP data for complex disorders in large, potentially informative, kindreds are discussed.
View details for DOI 10.1002/ajmg.b.30323
View details for Web of Science ID 000238054200008
View details for PubMedID 16652356
The genome-wide distribution of linkage disequilibrium (LD) determines the strategy for selecting markers for association studies, but it varies between populations. We assayed LD in large samples (200 individuals) from each of 11 well-described population isolates and an outbred European-derived sample, using SNP markers spaced across chromosome 22. Most isolates show substantially higher levels of LD than the outbred sample and many fewer regions of very low LD (termed 'holes'). Young isolates known to have had relatively few founders show particularly extensive LD with very few holes; these populations offer substantial advantages for genome-wide association mapping.
View details for DOI 10.1038/ng1770
View details for Web of Science ID 000237147500017
View details for PubMedID 16582909
We propose a dictionary model for haplotypes. According to the model, a haplotype is constructed by randomly concatenating haplotype segments from a given dictionary of segments. A haplotype block is defined as a set of haplotype segments that begin and end with the same pair of markers. In this framework, haplotype blocks can overlap, and the model provides a setting for testing the accuracy of simpler models invoking only nonoverlapping blocks. Each haplotype segment in a dictionary has an assigned probability and alternate spellings that account for genotyping errors and mutation. The model also allows for missing data, unphased genotypes, and prior distribution of parameters. Likelihood evaluations rely on forward and backward recurrences similar to the ones encountered in hidden Markov models. Parameter estimation is carried out with an EM algorithm. The search for the optimal dictionary is particularly difficult because of the variable dimension of the model space. We define a minimum description length criteria to evaluate each dictionary and use a combination of greedy search and careful initialization to select a best dictionary for a given dataset. Application of the model to simulated data gives encouraging results. In a real dataset, we are able to reconstruct a parsimonious dictionary that captures patterns of linkage disequilibrium well.
View details for Web of Science ID 000237966000011
View details for PubMedID 16706724
In systems like Escherichia Coli, the abundance of sequence information, gene expression array studies and small scale experiments allows one to reconstruct the regulatory network and to quantify the effects of transcription factors on gene expression. However, this goal can only be achieved if all information sources are used in concert.Our method integrates literature information, DNA sequences and expression arrays. A set of relevant transcription factors is defined on the basis of literature. Sequence data are used to identify potential target genes and the results are used to define a prior distribution on the topology of the regulatory network. A Bayesian hidden component model for the expression array data allows us to identify which of the potential binding sites are actually used by the regulatory proteins in the studied cell conditions, the strength of their control, and their activation profile in a series of experiments. We apply our methodology to 35 expression studies in E.Coli with convincing results.www.genetics.ucla.edu/labs/sabatti/software.htmlThe supplementary material are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btk017
View details for Web of Science ID 000236111600015
View details for PubMedID 16368767
Benign recurrent vertigo (BRV) is a common disorder affecting up to 2% of the adult population and may be etiologically related to migraine because of similarities in the clinical spectrum of the phenotypes and a high co-morbidity within families. Many families have multiple-affected genetically related individuals suggesting familial transmission of the disorder with moderate to high penetrance. While clinically similar to episodic ataxias, there are currently no genes identified that contribute to BRV and no systematic linkage studies performed. In an initial effort to genetically define BRV, we have selected from our Neurology Clinic population a subset of 20 multigenerational families with apparent autosomal dominant transmission, and performed genetic linkage mapping using both parametric and non-parametric linkage (NPL) approaches. The Affymetrix 10K SNP Mapping Assay was used for the genotyping. Heterogeneity LOD (HLOD) analysis reveals the evidence of genetic heterogeneity for BRV and evidence of linkage in a subset of the families to 22q12 (HLOD = 4.02). An additional region was identified by NPL analysis at 5p15 (LOD = 2.63). As migraine is observed substantially more commonly both within the BRV-affected individuals and the related family members, it is possible that a form of migraine is allelic to the BRV locus at 22q12. However, testing linkage or the chromosome 22q12 region to a broader migraine/vertigo phenotype by defining affectation status as either migrainous headaches or BRV greatly weakened the linkage signal, and no significant other peaks were detected. Thus, BRV and migraine does not appear to be allelic disorders within these families. We conclude that BRV is a heterogeneous genetic disorder, appears genetically distinct from migraine with aura and is linked to 22q12. Additional family and population-based linkage and association studies will be needed to determine the causative alleles.
View details for DOI 10.1093/hmg/ddi441
View details for Web of Science ID 000234630400007
View details for PubMedID 16330481
Analyze the information contained in homozygous haplotypes detected with high density genotyping.We analyze the genotypes of approximately 2,500 markers on chr 22 in 12 population samples, each including 200 individuals. We develop a measure of disequilibrium based on haplotype homozygosity and an algorithm to identify genomic segments characterized by non-random homozygosity (NRH), taking into account allele frequencies, missing data, genotyping error, and linkage disequilibrium.We show how our measure of linkage disequilibrium based on homozygosity leads to results comparable to those of R(2), as well as the importance of correcting for small sample variation when evaluating D'. We observe that the regions that harbor NRH segments tend to be consistent across populations, are gene rich, and are characterized by lower recombination.It is crucial to take into account LD patterns when interpreting long stretches of homozygous markers.
View details for DOI 10.1159/000096599
View details for Web of Science ID 000242847200001
View details for PubMedID 17077642
Late endosomes and lysosomes of mammalian cells in interphase tend to concentrate in the perinuclear region that harbors the microtubule-organizing center. We have previously reported abnormal distribution of these organelles - as judged by reduced percentages of cells displaying pronounced perinuclear accumulation - in mutant fibroblasts lacking BLOC-3 (for ;biogenesis of lysosome-related organelles complex 3'). BLOC-3 is a protein complex that contains the products of the genes mutated in Hermansky-Pudlak syndrome types 1 and 4. Here, we developed a method based on image analysis to estimate the extent of organelle clustering in the perinuclear region of cultured cells. Using this method, we corroborated that the perinuclear clustering of late endocytic organelles containing Lamp1 (for ;lysosome-associated membrane protein 1') is reduced in BLOC-3-deficient murine fibroblasts, and found that it is apparently normal in fibroblasts deficient in BLOC-1 or BLOC-2, which are another two protein complexes associated with Hermansky-Pudlak syndrome. Wild-type and mutant fibroblasts were transfected to express human LAMP1 fused at its cytoplasmic tail to green fluorescence protein (GFP). At low expression levels, LAMP1-GFP was targeted correctly to late endocytic organelles in both wild-type and mutant cells. High levels of LAMP1-GFP overexpression elicited aberrant aggregation of late endocytic organelles, a phenomenon that probably involved formation of anti-parallel dimers of LAMP1-GFP as it was not observed in cells expressing comparable levels of a non-dimerizing mutant variant, LAMP1-mGFP. To test whether BLOC-3 plays a role in the movement of late endocytic organelles, time-lapse fluorescence microscopy experiments were performed using live cells expressing low levels of LAMP1-GFP or LAMP1-mGFP. Although active movement of late endocytic organelles was observed in both wild-type and mutant fibroblasts, quantitative analyses revealed a relatively lower frequency of microtubule-dependent movement events, either towards or away from the perinuclear region, within BLOC-3-deficient cells. By contrast, neither the duration nor the speed of these microtubule-dependent events seemed to be affected by the lack of BLOC-3 function. These results suggest that BLOC-3 function is required, directly or indirectly, for optimal attachment of late endocytic organelles to microtubule-dependent motors.
View details for DOI 10.1242/jcs.02633
View details for Web of Science ID 000233883500009
View details for PubMedID 16249233
The authors recently introduced a framework, named Network Component Analysis (NCA), for the reconstruction of the dynamics of transcriptional regulators' activities from gene expression assays. The original formulation had certain shortcomings that limited NCA's application to a wide class of network dynamics reconstruction problems, either because of limitations in the sample size or because of the stringent requirements imposed by the set of identifiability conditions. In addition, the performance characteristics of the method for various levels of data noise or in the presence of model inaccuracies were never investigated. In this article, the following aspects of NCA have been addressed, resulting in a set of extensions to the original framework: 1) The sufficient conditions on the a priori connectivity information (required for successful reconstructions via NCA) are made less stringent, allowing easier verification of whether a network topology is identifiable, as well as extending the class of identifiable systems. Such a result is accomplished by introducing a set of identifiability requirements that can be directly tested on the regulatory architecture, rather than on specific instances of the system matrix. 2) The two-stage least square iterative procedure used in NCA is proven to identify stationary points of the likelihood function, under Gaussian noise assumption, thus reinforcing the statistical foundations of the method. 3) A framework for the simultaneous reconstruction of multiple regulatory subnetworks is introduced, thus overcoming one of the critical limitations of the original formulation of the decomposition, for example, occurring for poorly sampled data (typical of microarray experiments). A set of monte carlo simulations we conducted with synthetic data suggests that the approach is indeed capable of accurately reconstructing regulatory signals when these are the input of large-scale networks that satisfy the suggested identifiability criteria, even under fairly noisy conditions. The sensitivity of the reconstructed signals to inaccuracies in the hypothesized network topology is also investigated. We demonstrate the feasibility of our approach for the simultaneous reconstruction of multiple regulatory subnetworks from the same data set with a successful application of the technique to gene expression measurements of the bacterium Escherichia coli.
View details for Web of Science ID 000235704400002
View details for PubMedID 17044167
Gene expression arrays enable measurements of transcription values for a large number or all genes in the genome. In order to better interpret these results and to use them to reconstruct transcription networks, information on location of binding sites for regulatory proteins in the entire genome is needed. In particular, this represents an open problem in Escherichia coli.We describe the first implementation of dictionary-style models to the study of transcription factors binding sites in an entire genome. Vocabulon's unique feature is that it can both reconstruct binding sites characterized by unknown motifs and impute locations of known binding sites in long sequences by simultaneous search. On one hand, the dictionary model specifies a probability for the entire sequence taking simultaneously into account all the possible binding sites. This greatly reduces the number of false positives. On the other hand, the possibility of refining motif description, as an increasing number of binding sites are identified, augments the sensitivity of the method. We illustrate these properties with examples in E.coli. The results of gene expression arrays are used both to guide the search and corroborate it.
View details for DOI 10.1093/bioinformatics/bti083
View details for Web of Science ID 000227977800012
View details for PubMedID 15509602
Gene microarray technology is often used to compare the expression of thousand of genes in two different cell lines. Typically, one does not expect measurable changes in transcription amounts for a large number of genes; furthermore, the noise level of array experiments is rather high in relation to the available number of replicates. For the purpose of statistical analysis, inference on the "population'' difference in expression for genes across the two cell lines is often cast in the framework of hypothesis testing, with the null hypothesis being no change in expression. Given that thousands of genes are investigated at the same time, this requires some multiple comparison correction procedure to be in place. We argue that hypothesis testing, with its emphasis on type I error and family analogues, may not address the exploratory nature of most microarray experiments. We instead propose viewing the problem as one of estimation of a vector known to have a large number of zero components. In a Bayesian framework, we describe the prior knowledge on expression changes using mixture priors that incorporate a mass at zero, and we choose a loss function that favors the selection of sparse solutions. We consider two different models applicable to the microarray problem, depending on the nature of replicates available, and show how to explore the posterior distributions of the parameters using MCMC. Simulations show an interesting connection between this Bayesian estimation framework and false discovery rate (FDR) control. Finally, two empirical examples illustrate the practical advantages of this Bayesian estimation paradigm.
View details for Web of Science ID 000238478100016
View details for PubMedID 16646840
We describe domain pair exclusion analysis (DPEA), a method for inferring domain interactions from databases of interacting proteins. DPEA features a log odds score, Eij, reflecting confidence that domains i and j interact. We analyzed 177,233 potential domain interactions underlying 26,032 protein interactions. In total, 3,005 high-confidence domain interactions were inferred, and were evaluated using known domain interactions in the Protein Data Bank. DPEA may prove useful in guiding experiment-based discovery of previously unrecognized domain interactions.
View details for DOI 10.1186/gb-2005-6-10-r89
View details for Web of Science ID 000232679600012
View details for PubMedID 16207360
Of the more than 40 genetically defined dominantly inherited hearing loss syndromes, only a few are associated with bilateral vestibulopathy. No genetic mutations have been identified in families with bilateral vestibulopathy and normal hearing.To perform a genome-wide scan for linkage in four families with dominantly inherited bilateral vestibulopathy.Patients in four families reported brief episodes of vertigo followed by imbalance and oscillopsia. Bilateral vestibulopathy was documented with quantitative rotational testing. Most patients with bilateral vestibulopathy also had migraine. A 10 cM genome-wide screen was conducted using 423 microsatellite markers to identify linkage with vestibulopathy.The authors identified a 24 cM region on chromosome 6q suggestive of linkage to vestibulopathy in these four families (maximum lod score of 2.9 at marker D6S1556). A small fifth family with a different phenotype was not linked to this region on chromosome 6q.This is the first report of linkage in families with dominantly inherited vestibulopathy and normal hearing. Genetic heterogeneity is likely with inherited vestibulopathy.
View details for Web of Science ID 000226010000030
View details for PubMedID 15623703
We describe a unique family in which several individual are affected with episodes of ataxia that best fit the phenotype of episodic ataxia type 2 (EA2). All of the affected family members had episodes typically lasting for several hours, and none of them had muscle abnormalities including myokymia. Episodic ataxia type 1 (EA1) was not considered initially as a clinical diagnosis for the affected individuals in this family. However, by linkage mapping, sequencing and polymorphism analysis, all affecteds were found to have a novel mutation in KCNA1. Numerous missense mutations have been described previously in KCNA1 that cause EA1. The mutation c.1025G>T replaces a highly conserved serine with isoleucine at position 342 (p.Ser342Ile) in the highly conserved fifth transmembrane domain of the KCNA1. This mutation leads to a distinct clinical phenotype without myokymia broadening the scope of clinical characteristics of EA1 and highlighting the heterogeneity of phenotypic effects from distinct missense mutations.
View details for PubMedID 15532032
Efforts to identify gene variants associated with susceptibility to common diseases use three approaches: pedigree and affected sib-pair linkage studies and association studies of population samples. The different aims of these study designs reflect their derivation from biological versus epidemiological traditions. Similar principles regarding determination of the evidence levels required to consider the results statistically significant apply to both linkage and association studies, however. Such determination requires explicit attention to the prior probability of particular findings, as well as appropriate correction for multiple comparisons. For most common diseases, increasing the sample size in a study is a crucial step in achieving statistically significant genetic mapping results. Recent studies suggest that the technology and statistical methodology will soon be available to make well-powered studies feasible using any of these approaches.
View details for DOI 10.1038/ng1433
View details for Web of Science ID 000224156500009
View details for PubMedID 15454942
Cells adjust gene expression profiles in response to environmental and physiological changes through a series of signal transduction pathways. Upon activation or deactivation, the terminal regulators bind to or dissociate from DNA, respectively, and modulate transcriptional activities on particular promoters. Traditionally, individual reporter genes have been used to detect the activity of the transcription factors. This approach works well for simple, non-overlapping transcription pathways. For complex transcriptional networks, more sophisticated tools are required to deconvolute the contribution of each regulator. Here, we demonstrate the utility of network component analysis in determining multiple transcription factor activities based on transcriptome profiles and available connectivity information regarding network connectivity. We used Escherichia coli carbon source transition from glucose to acetate as a model system. Key results from this analysis were either consistent with physiology or verified by using independent measurements.
View details for DOI 10.1073/pnas.0305287101
View details for Web of Science ID 000188210400042
View details for PubMedID 14694202
A semiblind deconvolution method of analysis for gene expression data was proposed recently in a series of articles appeared in PNAS. We illustrate here how similar goals can be achieved in a Bayesian framework and how necessary information on the presence of binding sites can be obtained with Vocabulon, an algorithm based on a stochastic dictionary model.
View details for PubMedID 17270892
High-dimensional data sets generated by high-throughput technologies, such as DNA microarray, are often the outputs of complex networked systems driven by hidden regulatory signals. Traditional statistical methods for computing low-dimensional or hidden representations of these data sets, such as principal component analysis and independent component analysis, ignore the underlying network structures and provide decompositions based purely on a priori statistical constraints on the computed component signals. The resulting decomposition thus provides a phenomenological model for the observed data and does not necessarily contain physically or biologically meaningful signals. Here, we develop a method, called network component analysis, for uncovering hidden regulatory signals from outputs of networked systems, when only a partial knowledge of the underlying network topology is available. The a priori network structure information is first tested for compliance with a set of identifiability criteria. For networks that satisfy the criteria, the signals from the regulatory nodes and their strengths of influence on each output node can be faithfully reconstructed. This method is first validated experimentally by using the absorbance spectra of a network of various hemoglobin species. The method is then applied to microarray data generated from yeast Saccharamyces cerevisiae and the activities of various transcription factors during cell cycle are reconstructed by using recently discovered connectivity information for the underlying transcriptional regulatory networks.
View details for DOI 10.1073/pnas.2136632100
View details for Web of Science ID 000187554600044
View details for PubMedID 14673099
The genetic programs underlying neural stem cell (NSC) proliferation and pluripotentiality have only been partially elucidated. We compared the gene expression profile of proliferating neural stem cell cultures (NS) with cultures differentiated for 24 h (DC) to identify functionally coordinated alterations in gene expression associated with neural progenitor proliferation. The majority of differentially expressed genes (65%) were upregulated in NS relative to DC. Microarray analysis of this in vitro system was followed by high throughput screening in situ hybridization to identify genes enriched in the germinal neuroepithelium, so as to distinguish those expressed in neural progenitors from those expressed in more differentiated cells in vivo. NS cultures were characterized by the coordinate upregulation of genes involved in cell cycle progression, DNA synthesis, and metabolism, not simply related to general features of cell proliferation, since many of the genes identified were highly enriched in the CNS ventricular zones and not widely expressed in other proliferating tissues. Components of specific metabolic and signal transduction pathways, and several transcription factors, including Sox3, FoxM1, and PTTG1, were also enriched in neural progenitor cultures. We propose a putative network of gene expression linking cell cycle control to cell fate pathways, providing a framework for further investigations of neural stem cell proliferation and differentiation.
View details for DOI 10.1016/S0012-1606(03)00274-4
View details for Web of Science ID 000185224400011
View details for PubMedID 12941627
We explore the implications of the false discovery rate (FDR) controlling procedure in disease gene mapping. With the aid of simulations, we show how, under models commonly used, the simple step-down procedure introduced by Benjamini and Hochberg controls the FDR for the dependent tests on which linkage and association genome screens are based. This adaptive multiple comparison procedure may offer an important tool for mapping susceptibility genes for complex diseases.
View details for Web of Science ID 000183880000042
View details for PubMedID 12807801
Horizontal gaze palsy with progressive scoliosis (HGPS) is a rare, autosomal recessive disorder characterized by a congenital absence of conjugate horizontal eye movement, with progressive scoliosis developing in childhood or adolescence. The authors identified two unrelated consanguineous families with HGPS. Genomewide homozygosity mapping and linkage analysis mapped the disease locus to a 30-cM interval on chromosome 11q23-25 (combined maximum multipoint lod score Z = 5.46).
View details for Web of Science ID 000177335800022
View details for PubMedID 12177379
The prediction of operons, the smallest unit of transcription in prokaryotes, is the first step towards reconstruction of a regulatory network at the whole genome level. Sequence information, in particular the distance between open reading frames, has been used to predict if adjacent Escherichia coli genes are in an operon. While appreciably successful, these predictions need to be validated and refined experimentally. As a growing number of gene expression array experiments on E.coli became available, we investigated to what extent they could be used to improve and validate these predictions. To this end, we examined a large collection of published microarry data. The correlation between expression ratios of adjacent genes was used in a Bayesian classification scheme to predict whether the genes are in an operon or not. We found that for the genes whose expression levels change significantly across the experiments in the data set, the currently available gene expression data allowed a significant refinement of the sequenced-based predictions. We report these co-expression correlations in an E.coli genomic map. For a significant portion of gene pairs, however, the set of array experiments considered did not contain sufficient information to determine whether they are in the same transcriptional unit. This is not due to unreliability of the array data per se, but to the design of the experiments analyzed. In general, experiments that perturb a large number of genes offer more information for operon prediction than confined perturbations. These results provide a rationale for conducting expression studies comparing conditions that cause global changes in gene expression.
View details for Web of Science ID 000176607000021
View details for PubMedID 12087173
We illustrate how homozygosity of haplotypes can be used to measure the level of disequilibrium between two or more markers. An excess of either homozygosity or heterozygosity signals a departure from the gametic phase equilibrium: We describe the specific form of dependence that is associated with high (low) homozygosity and derive various linkage disequilibrium measures. They feature a clear biological interpretation, can be used to construct tests, and are standardized to allow comparison across loci and populations. They are particularly advantageous to measure linkage disequilibrium between highly polymorphic markers.
View details for Web of Science ID 000175237200039
View details for PubMedID 11973323
We consider array experiments that compare expression levels of a high number of genes in two cell lines with few repetitions and with no subject effect. We develop a statistical model that illustrates under which assumptions thresholding is optimal in the analysis of such microarray data. The results of our model explain the success of the empirical rule of two-fold change. We illustrate a thresholding procedure that is adaptive to the noise level of the experiment, the amount of genes analyzed, and the amount of genes that truly change expression level. This procedure, in a world of perfect knowledge on noise distribution, would allow reconstruction of a sparse signal, minimizing the false discovery rate. Given the amount of information actually available, the thresholding rule described provides a reasonable estimator for the change in expression of any gene in two compared cell lines.
View details for Web of Science ID 000174539900003
View details for PubMedID 11867081
Archival formalin-fixed, paraffin-embedded and ethanol-fixed tissues represent a potentially invaluable resource for gene expression analysis, as they are the most widely available material for studies of human disease. Little data are available evaluating whether RNA obtained from fixed (archival) tissues could produce reliable and reproducible microarray expression data. Here we compare the use of RNA isolated from human archival tissues fixed in ethanol and formalin to frozen tissue in cDNA microarray experiments. Since an additional factor that can limit the utility of archival tissue is the often small quantities available, we also evaluate the use of the tyramide signal amplification method (TSA), which allows the use of small amounts of RNA. Detailed analysis indicates that TSA provides a consistent and reproducible signal amplification method for cDNA microarray analysis, across both arrays and the genes tested. Analysis of this method also highlights the importance of performing non-linear channel normalization and dye switching. Furthermore, archived, fixed specimens can perform well, but not surprisingly, produce more variable results than frozen tissues. Consistent results are more easily obtainable using ethanol-fixed tissues, whereas formalin-fixed tissue does not typically provide a useful substrate for cDNA synthesis and labeling.
View details for Web of Science ID 000173551200028
View details for PubMedID 11788730
Compared to mixed populations, population isolates such as Finland show distinct differences in the prevalence of disease mutations. However, little information exists of the differences on the prevalence of different disease alleles in regional populations with different history of multiple bottlenecks. We constructed a DNA-array and monitored the prevalence of 31 rare and common disease mutations underlying 27 clinical phenotypes in a large population-based study sample. Over 64 000 genotypes were assigned in 2151 samples from four geographical areas representing early and late settlement regions of Finland. Each sample was analyzed in duplicate and a total of 142 000 array-derived genotyping calls were made. On average one in three individuals was found to be a carrier of one of the 31 monitored mutations. This should remove fears of the stigmatizing effect of a carrier-screening program monitoring multiple diseases. Regional differences were found in the prevalence of mutations, providing molecular evidence for the deviating population histories of regional subisolates. The mutations introduced early into the population revealed relatively even distribution in different subregions. More recently introduced rare mutations showed local clustering of disease alleles, indicating the persistence of population subisolates and the effect of multiple bottlenecks in molding the population gene pool. Regional differences were observed also for common disease alleles. Such precise information of the carrier frequencies could form the basis for targeted genetic screens in this population. Our approach describes a general paradigm for large-scale carrier-screening programs also in other populations.
View details for Web of Science ID 000172870300001
View details for PubMedID 11751678
Haplotype analysis of disease chromosomes can help identify probable historical recombination events and localize disease mutations. Most available analyses use only marginal and pairwise allele frequency information. We have developed a Bayesian framework that utilizes full haplotype information to overcome various complications such as multiple founders, unphased chromosomes, data contamination, and incomplete marker data. A stochastic model is used to describe the dependence structure among several variables characterizing the observed haplotypes, for example, the ancestral haplotypes and their ages, mutation rate, recombination events, and the location of the disease mutation. An efficient Markov chain Monte Carlo algorithm was developed for computing the estimates of the quantities of interest. The method is shown to perform well in both real data sets (cystic fibrosis data and Friedreich ataxia data) and simulated data sets. The program that implements the proposed method, BLADE, as well as the two real datasets, can be obtained from http://www.fas.harvard.edu/~junliu/TechRept/01folder/diseq_prog.tar.gz.
View details for Web of Science ID 000171456000013
View details for PubMedID 11591648
To develop diagnostic testing guidelines for the DYT1 GAG deletion in the Ashkenazi Jewish (AJ) and non-Jewish (NJ) primary torsion dystonia (PTD) populations and to determine the range of dystonic features in affected DYT1 deletion carriers.The authors screened 267 individuals with PTD; 170 were clinically ascertained for diagnosis and treatment, 87 were affected family members ascertained for genetic studies, and 10 were clinically and genetically ascertained and included in both groups. We used published primers and PCR amplification across the critical DYT1 region to determine GAG deletion status. Features of dystonia in clinically ascertained (affected) DYT1 GAG deletion carriers and noncarriers were compared to determine a classification scheme that optimized prediction of carriers. The authors assessed the range of clinical features in the genetically ascertained (affected) DYT1 deletion carriers and tested for differences between AJ and NJ patients.The optimal algorithm for classification of clinically ascertained carriers was disease onset before age 24 years in a limb (misclassification, 16.5%; sensitivity, 95%; specificity, 80%). Although application of this classification scheme provided good separation in the AJ group (sensitivity, 96%; specificity, 88%), as well as in the group overall, it was less specific in discriminating NJ carriers from noncarriers (sensitivity, 94%; specificity, 69%). Using age 26 years as the cut-off and any site at onset gave a sensitivity of 100%, but specificity decreased to 54% (63% in AJ and 43% in NJ). Among genetically ascertained carriers, onset up to age 44 years occurred, although the great majority displayed early limb onset. There were no significant differences between AJ and NJ genetically ascertained carriers, except that a higher proportion of NJ carriers had onset in a leg, rather than an arm, and widespread disease.Diagnostic DYT1 testing in conjunction with genetic counseling is recommended for patients with PTD with onset before age 26 years, as this single criterion detected 100% of clinically ascertained carriers, with specificities of 43% to 63%. Testing patients with onset after age 26 years also may be warranted in those having an affected relative with early onset, as the only carriers we observed with onset at age 26 or later were genetically ascertained relatives of individuals whose symptoms started before age 26 years.
View details for Web of Science ID 000086908000007
View details for PubMedID 10802779