This talk will highlight recent developments in the study of phylogenetic invariants. In particular, assuming a general model of the mutation process of orthologous sequences, 'most' polynomial relationships in expected pattern frequencies can be explicitly constructed. These constructions are tied to specific topological features (edges and nodes) of a phylogenetic tree. This new understanding of invariants leads to theoretical results on the identifiability of the tree topology for models with increased biological realism, such as the covarion model and certain mixture models.
Wright's neighborhood size can be seen as a statement about the probability of coalescence of lineages integrated over space. Moving backwards in time neighbourhood size increases and the probability of coalescence decreases. As such Wright's neighborhood model could potentially be used for coalescent inference over structured populations parameterised by parent-offspring dispersal and population density. This is in contrast to models parameterised by the size of panmictic units and migration vectors between them. If we wish to use coalescent inference over a study system, and lack prior knowledge of the scale at which panmixis can be assumed, Wright's neighborhood model seems appropriate. Here I show how Wright's neighborhood model can be implemented on a lattice, allowing sampling of the properties of genealogies in space and time for a set of georeferenced field observations. I contrast two sampling approaches that allow Bayesian inference over these genealogies and discuss the implications for inference over recent timescales (geneflow, population structure) and deeper timescales (phylogeographic process).
I describe a Bayesian method that uses summary statistics measured from microsatellite loci to make inferences about demographic parameters in 2- and 3-population models. Preliminary results with an infinite sites model of sequence evolution are also described. The method can be used to infer effective sizes of current and ancestral populations, immigration rates, splitting times and tree topology (in the 3-population case). A novel method for model selection is introduced. Comparisons are made with the IM program of Hey and Nielsen, and a data set of 19 microsatellite loci from Channel Island foxes is analysed. It is concluded that the method is competitive with IM on 2-population data. There appears to be little scope for accurate inference with microsatellite data unless very large numbers of loci are used.
The results of human action, from the scale of the climate to the niche, will dominate our evolutionary future. Meaningful ways of intersecting theoretical and empirical studies with conservation and management of our natural resources are important. Using tropical tree communities as an example, the application of phylogenetic and biogeographic evidence to mitigating some of this change will be discussed. Emergent questions, with implications for the utility of this data, will be explored. These questions have inspired the development of a DNA microarray based technique for gathering genomic samples of neutral variation in previously unstudied organisms. The approach, called Hyperdispersed Illiterate Primer Screening (HIPS), will be particularly effective for developing a database of genomic signatures that can allow phylogenetically scalable queries, virtual subtractive hybridizations, and the rapid development of simple downstream bioassays for screening large numbers of individuals.
Speciation within the western Melanoplus grasshoppers has probably been influenced by Pleistocene climate change. In order to test models of speciation in this clade, we used the isolation-with-migration model to estimate the species divergence between Melanoplus montanus and M. oregonesis. We created a genomic library, gathered sequence data from four anonymous nuclear loci and the mitochondrial cytochrome oxidase I gene, and then estimated population divergence at approximately 200,000 years before present. However, due to widespread incomplete lineage sorting in western Melanoplus, we can not be confident that M. montanus and M. oregonesis are sister taxa. In order to explore the potential bias introduced by incomplete taxon sampling, we conducted a simulation study. It suggested that incomplete taxon sampling can cause the ancestral effective population size to be underestimated and population divergence to be overestimated, particularly as the actual population divergence increases.
The problem of inferring trees of closely related species from multilocus data sets suffers from a lack of robust implementations of existing theory and from lack of empirical data on which to help set priorities for new directions. We have been accumulating multilocus DNA sequence data sets of anonymous, noncoding regions of Australian songbird genomes to examine the historical demography of speciation and population structure. Using two data sets from northern Australia, one from grassfinches (Poephila) and one from treecreepers (Climacteris), I illustrate the potential of anonymous loci to provide a higher resolving power for current and ancestral population parameters than mitochondrial DNA, and for inferring relationships among closely related species when gene trees conflict with one another. However, our studies also pinpoint several gaps in existing software packages that prevent full exploration of the data. In particular our data reveals the need for an integrated approach to estimating the sequence of speciation events (species phylogeny) from multilocus data sets that does not require a priori assumptions. In addition, the data sets reveal a need for analyses of gene flow that can encompass more than one species even when there is no current gene flow between those species. These studies, like those in Drosophila and humans, show that even phylogeographic analyses focused on single species in general will require analysis of sequence data from multiple species, especially those that continue to share residual polymorphisms with the focal species, and will require implementations of theory that can accommodate multispecies data sets.
A unique gene tree describing the mutation history of a sample of DNA sequences can be constructed as a perfect phylogeny under an assumption of non-recurrent point mutations. An empirical distribution of the stochastic history of the gene tree, conditional on its topology, can be found by an advanced simulation technique of importance sampling on coalescent histories. The distribution of the time to the most recent common ancestor and ages of mutations in the gene tree, conditional on its topology, can be found from empirical distribution. This talk will present examples of ancestral inference from gene trees, microsatellite data, and sketch the importance sampling technique.
When geological data suggests that co-distributed taxon-pairs arose simultaneously from allopatric divergence across an emergent biogeographic barrier, often one finds elevated variability in genetic divergences across phylogeographic datasets. Assuming there are no undetected extinctions or large variation in mutation rates, such disparity in genetic divergences often leads to ecologically deterministic explanations for the variance in divergence times without accounting for mutational and coalescent variance. To test for simultaneous divergence across phylogeographic datasets while accounting for variability associated with such stochastic processes, we combine Beaumont's flexible approximate Bayesian computational (ABC) framework and a finite-sites version of Hudson's coalescent simulator. This highly parameterized framework is extensively tested across a range of conditions and is shown to be somewhat accurate with only single locus mitochondrial data. We use this method to reject a history of simultaneous vicariance in eight taxon-pairs of tropical sea urchins thought to have arisen by the rise of the Panamanian Isthmus ~3.1 Mya, with only 3 % of the posterior density signifying a history of simultaneous vicariance. By simulation, two of the taxon-pairs are suggested to be outliers, and after their removal, the posterior density suggests a history of concordance in divergence times resulting in an estimate of the CO1 mtDNA mutation rate being 1.07% per million years.
Joint work with Eli Stahl and Harilaos Lessios.
Conditioning out phylogenetic information in HIV sequences, we performed multivariate studies of eventual drug resistant mutations using multidimensionnal scaling and correspondence analyses methods, we propose several approaches to the problem of correlated variables in this context.
Most methods for detecting Darwinian natural selection at the molecular level rely on estimating the rates or numbers of nonsynonymous and synonymous changes in an alignment of protein- coding DNA sequences. In some of these methods, the nonsynonymous rate of substitution is allowed to vary across the sequence, permitting the identification of single amino-acid positions that are under positive natural selection. However, it is unclear which probability distribution should be used to describe how the nonsynonymous rate of substitution varies across the sequence. One widely used solution is to model variation in the nonsynonymous rate across the sequence as a mixture of several discrete or continuous probability distributions. Unfortunately, there is little population genetics theory to inform us of the appropriate probability distribution for among-site variation in the nonsynonymous rate of substitution. Here, we describe an approach to modeling variation in the nonsynonymous rate of substitution using a Dirichlet process mixture model. The Dirichlet process allows there to be a countably infinite number of nonsynonymous rate classes, and is very flexible in accommodating different potential distributions for the nonsynonymous rate of substitution. We implemented the model in a fully Bayesian approach, with all parameters of the model considered as random variables.
The climatic cycles of the Quaternary have influenced the distribution of many species. Despite large amounts of data available there is still a need for theoretical studies to help understand the genetic consequences of those cycles. We have used a coalescent approach to examine some of these consequences. We have modeled demographic history as two alternating phases corresponding to glacial and interglacial periods. An island model was assumed for both phases, and the transition between them happened in one generation. The number of demes, deme size and migration rate were allowed to differ between the two phases. We have examined the effects of cyclic changes of both population size and structure. For a sample of two alleles from the same or separate demes we have obtained the distribution of coalescence times and its mean as functions of these population parameters and the duration of each type of phase. Deme size and migration rate were kept smaller in the glacial phases, but the number of demes could be larger - as when glaciations cause fragmentation. Reduced deme size during glacials produced peaks in the distribution of coalescence times and made the mean times small in some cases, whereas the effect of increased structure during these periods - e.g. from a smaller migration rate -attenuated these peaks and stretched the genealogy. These results are in accordance with inferences of long genealogies for many species, predating the last glaciation. This approach may help further our understanding of the genetic consequences of climatic cycles.
Joint work with Vera N. Solferini, Jon F.Wilkins, and John Wakeley.
It is now well known that incomplete lineage sorting can cause serious difficulties for phylogenetic and phylogeographic inference. Yet, little attention has been paid to methods that attempt to overcome these difficulties by explicitly considering the processes that produce them. Here I explore approaches to historical inference designed to consider retention and sorting of ancestral polymorphism. I examine how the reconstructability of a species (or population) histories is affected by (a) the number of loci used to estimate the phylogeny and (b) the number of individuals sampled per species (or population). Even in difficult cases with considerable incomplete lineage sorting (divergences times separated by less than 1Ne generations), accurate historical reconstructions are possible, as long as a reasonable numbers of individuals and loci are sampled. Moreover, tradeoffs between sampling more loci versus more individuals shift depending on the depth of the species history under study. Taken together these results demonstrate that gene sequences retain enough signal to achieve an accurate estimate of history despite widespread incomplete lineage sorting. Continued methodological improvements for inference near the species level require not only a statistical framework for evaluating the likelihood of particular gene trees, but also a shift to compound models that consider the molecular evolutionary process of nucleotide substitutions, as well as the population genetics processes of lineage sorting.
It has been known that gene trees need not agree with species trees, because of deep coalescence, gene duplication and loss and horizontal transfer. To reconcile the difference and estimate the species tree has been a hot spot in the field of evolutionary molecular biology. Here, we propose a Bayesian method to estimate the phylogenetic tree of a group of species using multiple estimated gene tree distributions such as those that arise in a Bayesian analysis of DNA sequence data. It is assumed that DNA sequences are conditional independent of species tree given gene tree. The whole process can be represented as a 2-step Markov chain, from species tree to gene tree and from gene tree to DNA sequences. The first step can be explained by coalescent theory. The process of the second step follows an evolutionary model at the molecular level. For each group of DNA sequences, MrBayes is used together with an importance sampling technique to sample gene trees from the posterior distribution P(gene tree | DNA sequences). Given those gene trees, species trees are sampled from its posterior distribution P(species tree | gene tree). The species trees generated follows the posterior distribution of P(species tree | DNA sequences). The posterior probability of parameters such as population sizes of ancestral species are estimated as well. Multiple chains are used to monitor the convergence. The time consumed is linear with the number of genes. The method is applied to analyze both simulated and experimental DNA sequence data.
Evolutionary biogeography seeks to determine the effects of fluctuations in habitat distribution and occupancy on properties of biological diversity (community, species and genetic), as mediated by spatio-temporal patterns of speciation, local extinction and dispersal/colonization. Estimation of population responses to past climate &/or geological change is of central importance and is most powerful when combined with independent evidence (or hypotheses) about past habitat distributions. We present a rich empirical case study concerning the fauna of rainforests from north-east Australia. Previously, we have developed comparative mtDNA phylogeographies and qualitatively compared the results to models of species or rainforest distributions under paleoclimates inferred from the cool-dry conditions of the Last Glacial Maximum, though cool-wet (8-6 Kya) and warm-wet (5-3 Kya) phases of the Holocene to the present. These models and molecular data suggest considerable variation in the temporal and spatial scale of species responses to rainforest fluctuation, but are limited in that only single locus has been examined (albeit for many species) and by the use of naivity distribution models. We are now generating multi-locus data to improve precision of parameter estimates and initial studies reveal both the promise of such data, but also significant limitations of current analytical methods. The greatest challenge is to incorporate spatially explicit hypotheses about habitat fluctuation directly into parameter estimation (see also talk by S. Baird; poster by M. Hickerson).
Joint work with C. Graham, S. Williams, J. MacKenzie, M. Hickerson, G. Dolman, A. Moussalli, C. Hoskin, A. Hugall, K. Bell, M. Tonione, and A. Carnaval.
Recent studies on the conserved amino acid regions of phosphoglycerate kinase (PGK) and other examples with known crystallographic structure, suggest that the rate of their evolution may depend on their location in the molecule. PGK is a highly conserved enzyme central to the process of fermentation that is found in all Kingdoms in nature. While the most conserved amino acids tend to be near the functional core of the molecule, the least conserved amino acids are mainly found on the periphery. In this study, we examine the relationship between the rate variation of each amino acid site and its distance from its associated metal ion ligand (Mg2+). Based on these results, a refined model of molecular evolution incorporating rate variation at each amino acid site that is dependent on the Euclidean distance to the metal ligand is proposed.
Joint work with Dennis K. Pearl and J. Dennis Pollack.
The phylogenetic relationships among most metazoan phyla remain uncertain. Here, we obtained large numbers of gene sequences from metazoans, including key understudied taxa. Despite the amount of data and breadth of taxa analyzed, relationships among most metazoan phyla remained unresolved. In contrast, the same genes robustly resolved phylogenetic relationships within a major clade of Fungi of approximately the same age as the Metazoa. The differences in resolution within the two Kingdoms suggest that the early history of metazoans was a radiation compressed in time, in agreement with paleontological inferences. Furthermore, simulation analyses as well as studies of other radiations in deep time indicate that, given adequate sequence data, the lack of resolution in phylogenetic trees is a signature of closely spaced series of cladogenetic events.
New methods for applying genetic data to questions of historical biogeography have revolutionized our understanding of how organisms have moved around the planet to occupy their present distributions. Increasingly sophisticated phylogenetic methods, especially in combination with divergence time estimation, can reveal biogeographic centers of origin, differentiate between hypotheses of vicariance and dispersal, and in the latter case, reveal the directionality of dispersal events. Despite their power, however, phylogenetic methods often yield patterns that are compatible with multiple equally well-supported biogeographic hypotheses. We describe a multi-disciplinary approach to this problem, using a combination of coalescent, population genetic, and ecological analyses to discriminate among multiple phylogenetically well-supported hypotheses. This approach is used to discern which of several dispersal hypotheses is most probable for a genus of Old-World leaf-nosed bats, given the available data. From these synthetic analyses of the data, we are able to conclude that the best-supported hypothesis involves two independent dispersal events from Africa to Madagascar. Furthermore, divergence dates estimated with coalescent methods suggest that the two dispersal events occurred quite recently in geological time.
Steven M. Goodman and Anne D. Yoder
Recent extensive analyses of human DNA polymorphism reveal that the time of most recent common ancestor (TMRCA) at neutral loci seldom exceeds 2 myr. However, we recently found that the TMRCA at CMP-N-acetyleneuraminic acid hydroxylase (CMAH) locus is ca. 3myr. The phylogenetic analysis of CMAH haplotypes shows two distinct lineages which diverged ca. 3 myr ago: One is represented by a single descendant haplotype in the sub-Saharan Biaka Pygmy population and the other by the common ancestral lineage of all other haplotypes which began to diversify extensively 1 myr ago. For these two distinct lineages to be maintained this long, one may assume an operation of balancing selection at this locus. However, because CMAH lacks its function by Alu-mediated deletion of an exon, such selection is unlikely. In accord with this, neither Tajima's D nor the HKA test show any signature of selection. Here we examine the possibility that African populations have been relatively large and partially isolated from each other throughout the Plio-Pleistocene, thereby contributing the greater genetic diversity than non-Africa populations of young origins. Phylogenetic analyses of a dozen of loci for which reliable haplotype data are available reveal that their ancestral haplotypes occur almost exclusively in African samples. Computer simulation confirms that this bias of the occurrence of ancestral haplotypes in Afrcan samples can be observed only under population structure in Africa and reduction of the effective size of the entire population since 2 myr ago. We also argue that there must have been some African populations which were not directly involved in the Out-of Africa expansion in the late Pleistocene.
Joint work with Naoyuki Takahata.
Random models for species formation and loss have played an important role in evolutionary biology since Yule's pioneering work in the 1920s. More recently these models have been investigated for the light they shed on both the topological properties (shape, balance, clade distribution, discrete tree reconstuction, tree rooting) and metric properties (branch length distribution, phylogenetic diversity) of phylogenies. In this talk I describe how these models are relevant for tree reconstruction and rooting, and the distribution of clade sizes, as well as the loss (and optimization) of phylogenetic diversity as taxa go extinct. The talk will include some historical survey, as well as some recent (and new) results.
Genomics research is generating vast molecular sequence data ranging from single genes to whole genomes across an increasing number of species. However, a fundamental difficulty in evolutionary studies emerges as the availability of sequences expands. Phylogenetics methods to reconstruct the evolutionary tree relating the sequences traditionally condition on a single, sometimes poorly estimated sequence alignment, where an alignment specifies which residues in the sequences derive from a common origin. This conditioning can cause bias and inappropriate infer in genomic studies, particularly when the sequences are highly diverse. For example, the early branching-order of Bacteria, Archaea and Eukaryotes, the three major domains of life, is troublesome to determine.
As a solution, I describe a novel Bayesian model for simultaneously estimating alignments and the phylogenetic trees that relate the sequences. This sidesteps the bias issue inherent in sequential estimation. Joint estimation also allows one to model rate variation between sites when estimating the alignment and to use the evidence in shared insertion/deletions (indels) in the sequences to group sister species in the tree. I base this indel process on a Hidden Markov Model that makes use of affine gap penalties and considers indels of multiple residues.
I develop a Markov chain Monte Carlo (MCMC) method to sample from the posterior of the joint model, estimating the most probable alignment and tree and their support simultaneously. I describe a new MCMC transition kernel based on the Forward-Backward algorithm and a careful choice of parameter marginalization that improves our algorithm's mixing efficiency, allowing the MCMC chains to converge even when started from arbitrary alignments. Finally, my software implementation can estimate alignment uncertainty and I describe a method for summarizing this uncertainty in a single plot.
Phylogenetic trees, also known as evolutionary trees, model the evolution of biological species or genes from a common ancestor. Most computational problems associated with phylogenetic tree reconstruction are very hard (specifically, they are NP-hard, and are practically hard, as real datasets can take years of analysis, without provably optimal solutions being found). Finding ways of speeding up the solutions to these problems is of major importance to systematic biologists. Other approaches take only polynomial time and have provable performance guarantees under Markov models of evolution; however, our recent work shows that the sequence lengths that suffice for these methods to be accurate with high probability grows exponentially in the diameter of the underlying tree.
In this talk, we will describe new dataset decomposition techniques, called the Disk-Covering Methods, for phylogenetic tree reconstruction. This basic algorithmic technique uses interesting graph theory, and can be used to reduce the sequence length requirement of polynomial time methods, so that polynomial length sequences suffice for accuracy with high probability (instead of exponential). We also use this technique to speed up the solution of NP-hard optimization problems, such as maximum likelihood and maximum parsimony.