Reliable identification of post-translational modification is key to understanding various cellular processes. We describe a tool, insPecT, to identify post-translational modifications using tandem mass spectrometry data. The tool is based upon a novel algorithms for the following: (a) Constructing tag based filters based on a novel de novo interpretation algorithm that works in the presence of modifications. The sequence tags help eliminate much of the database while retaining the true peptide; (b) a fast Trie based search for scanning the database with sequence tags; (c) a dynamic programming technique to identify candidate peptides with modifications without explicit enumeration of the modifications; (d) a scoring algorithm that is rapidly reconfigured for differing fragmentation propensities, and is independent of the length of the peptide, and (e) a novel quality score computation based on an optimization of complementary features for evaluating quality. The tool was tested on a number of real and simulated data-sets. InsPecT can search for modified and unmodified peptides in time that is faster than other database search tools. We identified a large number of modified peptides, including several novel phospho-petides in data-sets provided by the Alliance for Cellular signalling.
Joint work with Stephen Tanner, Hongjun Shu, Ari Frank, Marc Mumby, and Pavel Pevzner.
The analysis of mass spectrometry data is still largely based on identification of single MS/MS spectra and does not attempt to make use of the extra information available in multiple MS/MS spectra from partially or completely overlapping peptides. Analysis of MS/MS spectra from multiple overlapping peptides opens up the possibility of assembling MS/MS spectra into entire proteins, similarly to the assembly of overlapping DNA reads into entire genomes. This presentation will focus on new methods to detect, score and interpret overlaps between uninterpreted MS/MS spectra in an attempt to sequence entire proteins rather than individual peptides. This approach not only extends the length of reconstructed amino acid sequences but also dramatically improves the quality of de-novo peptide sequencing. Results will be presented using data from an ESI/IonTrap mass spectrometer.
Mass Spectrometry (MS) is a technology very well suited for high-throughput data acquisition, due to its speed and accuracy. Simplified, a mass spectrometer's input is a molecular mixture, and its output a list of masses of the sample molecules. The most well-known application in biotechnology is protein identification with database lookup, but MS is also increasingly used to analyze DNA and other biomolecules.
The sample biomolecules of an MS experiment can often be represented by strings over a weighted alphabet. Clearly, the order of characters cannot be determined from the weight. Thus, the problem leads to the study of weighted strings and compomers: A string's compomer is an integer vector specifying the number of occurrences of each character. We are interested in efficient algorithms for determining all or some compomers with a given mass, the number of such compomers, and related questions.
One of the pressing problems in the context of mass spectrometry is to transform the MS raw data into a list of peaks with masses and possibly other attributes. To do this, the noise in the raw spectrum has to be filtered out and in order to get the real mass of a peak, we have to deconvolute the spectrum with the isotopic distribution of each peak. We are working on robust methods for doing so, based on the stochastic concept of regression analysis.
Last, we want to identify biological molecules measured by MS. Given a list of candidates (e.g. a database), the measured spectrum has to be compared to the theoretically predicted spectrum of each sequence and a score has to be computed as a measure of quality of match. Here, it is important to combine the freedom to adjust matching scores to an application's peculiarities, with a rigid statistical analysis of score distributions. We present an approach that allows easy and fast estimation of p-values of such scores for two important applications, Peptide Mass Fingerprints and Tandem Mass Spectrometry.
We present algorithms and software developed in our research group for several of these questions, along with experimental data.
Joint work with Michael Kaltenbach and Zsuzsanna Lipták.
Peptide identification from tandem mass spectra is an important enabling technology for high-throughput proteomics pipelines. The search engines that analyze these spectra, such as Mascot or SEQUEST, use amino-acid sequence databases, such as UniProt, to generate putative peptides to compare against each spectrum. We have developed a method by which the entire peptide content of such an amino-acid sequence database can be represented by a new, smaller, amino-acid sequence database. Existing search software can be sped up, without modification, by using this compressed sequence database. Further, since fewer peptides are scored against each spectrum, the statistical significance of the same peptide scores are improved, making the search more sensitive. With peptide identifications in hand, an exact sequence search in the original sequence database restores the protein context for each peptide. The effectiveness of this approach is demonstrated using Mascot and the UniProt family of amino-acid sequence databases.
We present a novel scoring method for de novo interpretation of peptides from tandem mass spectrometry data. Our scoring method uses a probabilistic network whose structure reflects the chemical and physical rules that govern the peptide fragmentation. We use a likelihood ratio hypothesis test to determine if the peaks observed in the mass spectrum are more likely to have been produced under our fragmentation model, than under a model that treats peaks as random events. We tested our de novo algorithm PepNovo on Ion-Trap data, and achieved results that are superior to popular de novo peptide sequencing algorithms. PepNovo can be accessed via the URL http://peptide.ucsd.edu/.
Joint work with Pevel Pezner.
Mass spectrometry is a burgeoning method for proteomic studies because it is a high throughput method that offers low detection limits and high selectivity. Our work has focused on matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS). The signals obtained from mass spectrometry are intricate and can be influenced by the experimental design. The use of classification methods are useful for detecting biomarkers and making predictions based on mass spectral signals. MS of proteins from noninvasive samples has potential as a medical tool for early diagnoses of disease. Spectra from studies of amniotic fluids from women who had normal, normal with inflamed uteri, and premature delivery were used for building classification models.
Fuzzy classifications systems are considered soft methods that based on the variance of the data. Soft methods are advantageous because they avoid overfitting the data and the curse of dimensionality. Fuzzy Rule-Building Expert Systems (FuRES)1 are useful because an inductive classification tree is obtained as a model that may be interpreted. Using principal component compression, FuRES is simple, fast, reliable, and applicable to MS data. Coupled with Latin-partition methods precision bounds may be obtained for evaluation of predictability.
(1) Harrington, P. B. Journal of Chemometrics 1991, 5, 467-486.
Joint work with Nancy E. Vieira and Alfred L. Yergey.
In this talk we describe methods to identify the differential expression of peptides and propose strategies to avoid MS/MS identification of peptides of interest. The algorithms are embedded in the freely available software library OpenMS which is currently under development at the Freie Universitat Berlin and the Eberhardt-Karls Universitat Tubingen. We give an overview of the capabilities and design principles of OpenMS and demonstrate its ease of use. Finally we describe projects in which OpenMS will be or was already deployed and thereby demonstrate its versatility.
Growth hormone (GH) regulates cell growth and differentiation primarily by modulating gene expression and metabolism in target tissues. Targeted disruption of the gene encoding the growth hormone receptor and binding protein (GHR/BP-/-) functionally inactivates GH and generates long-lived, dwarf mice with elevated circulating GH and markedly reduced insulin-like growth factor-1 (IGF-1) levels (1, 2). Indeed, insulin/IGF-1signaling has been shown to be a critical determinant of lifespan in several species. GHR/BP-/- mice also have decreased fasting insulin and glucose levels (3) and appear to resist complications due to streptozotocin (STZ)-induced diabetes (4). To determine the consequences of the GHR/BP-/- mutation on gene expression in the central nervous system (CNS), brain tissue was harvested from normal and gene-disrupted mice at different developmental stages (young, adult and aged) and proteins were isolated from distinct subcellular fractions (nucleus, cytoplasm, polysomes) using differential gradient ultracentrifugation. The proteins in each fraction were resolved by two-dimensional gel electrophoresis and stained with the fluorescent dye SYPRO Orange. The images were captured with a high-resolution CCD camera (Bio-Rad Versa-Doc 3000) or a laser-scanning device (Fuji FLA-3000G) and quantitatively analyzed with PDQuest or Image Gauge software packages. Differentially expressed proteins were manually excised from the gels and identified by mass spectrometry. Of the hundreds of proteins resolved, several were differentially expressed in the brains of GHR/BP-/- mice relative to controls. The goal is to identify those proteins whose expression patterns are spatially and temporally correlated and establish functional protein networks that may delay or attenuate age-related tissue dysfunction or diabetic complications. This work was supported in part by the State of Ohio's Eminent Scholar Program which includes a gift from Milton and Lawrence Goll.
For the protein identification of novel proteins using MS/MS, searching the sequence tags obtained by de novo sequencing in a protein sequence database is the best way. However, de novo sequencing very often can give only partially correct sequence tags. The most commonly type of error found in the sequence tags is the same-mass segments replacement, i.e. a segment of amino acids is replaced with another one with the same mass. The current database search software such as MS-BLAST cannot handle the errors existing in the sequence tags. We developed a new efficient algorithm to align sequence tags from de novo sequencing with database sequences to identify proteins. This talk introduces the algorithms and implementation details of SPIDER software.
Shotgun proteomics approach has been used increasingly for high throughput analysis of complex protein samples. A major challenge lies in the consistent, objective and transparent analysis of the large amounts of data generated by such experiments and in their dissemination and publication. The first part of this presentation will focus on various statistical measures and approaches for estimating the confidence level of peptide identifications made by MS/MS database searching, including p-values, expectation values, reverse database searching, and the Bayesian classification. A comparison will be made with methods developed for the analysis of other types of data such as microarray gene expression.
Identification of peptides from MS/MS spectra represents the first step in the computational analysis of shotgun proteomics data. Most often, the goal of the experiment is to infer what proteins are present in the original sample. A statistical model for assembling peptides into proteins and computing protein probabilities will be presented. A special attention will be paid to the problem of non-random grouping of peptides according to their corresponding proteins ('single hit' identification problem). Furthermore, limitations of shotgun proteomics with respect to the accurate characterization of protein isoforms and mature protein forms will be discussed. Similar to the shotgun DNA fragment sequence assembly problem, the presence of 'degenerate' peptides (peptides whose sequence is present in multiple proteins) makes it difficult to infer what proteins are present in the sample. An informatics approach for dealing with the cases of degenerate peptides and presenting protein identification results to the biologists analyzing the data will be described.
Parallel protein measurements, aka proteomics, have the potential to provide information on biological systems in isolation as cell culture systems, tissues or in an organism. Whereas parallel measures of transcript (mRNA) abundance can be multiplexed more easily through microarray analysis of even small quantities of sample following amplification using PCR, parallel measures of protein abundance are more difficult due to the heterogeneity of protein properties compared with nucleic acids, and the inability to amplify the signal. However, despite these drawbacks much useful data can be generated, but the interpretation of such data sets is challenging. This presentation will focus on the kinds of datasets that are generated from a range of proteomics approaches including mass spectrometry (both LC-MS and MS/MS data) for unbiased analyses and multiplexed protein assays for targeted analyses. How and where such assays are employed and the pros and cons of such approaches will also be discussed.
In the talk we describe methods to reduce the amount of data obtained by (multi)-dimensional HPLC/MS experiments. The algorithms are embedded in the freely available software library OpenMS which is currently under development at the Freie Universitat Berlin and the Eberhardt-Karls Universitat Tubingen. We give an overview of the goals and problems in differential proteomics with HPLC and then describe in detail the implemented approaches for signal processing, peak detection and data reduction currently employed in OpenMS.
The presentation focuses on a two-dimensional probability model for peptide identification using tandem mass spectra and amino acid sequence databases. Probability models are developed for two of the parameters that affect the quality of peptide identification the most - number of product ion matches and the sum of the product ion abundances. Both models are derived from the direct comparison of experimental tandem mass spectrum to amino acid sequences from the protein database. The probabilities obtained from each model are correlated and normalized to derive a single score - significance of peptide identification.
The talk will discuss the comparison of the approach to other database search algorithms.
The analysis of complex protein mixtures by LC-MS is one of the key technologies for systematic large-scale observation and modeling of cellular processes. Mass spectrometry itself is exquisitely sensitive, and reproducibility at the signal level is high. It is known, however, that a significant number of peptides - especially those with modifications - are missed in current computational analyses. We present a new approach to the computational interpretation of the experimental data that globally integrates all data of one or more experiments instead of interpreting spectra one-by-one. Instead of attempting to detect the presence of a protein or its fragments from individual signals (peaks) in a single mass spectrograph, all data acquired across a whole experiment are first aligned into an n+1-dimensional space, where n is the number of dimensions used for the LC separation. This condenses all peaks generated by the same protein fragment throughout the experiment into a single dense signal, which allows a much better separation of signal and noise. The main computational challenge is to compensate for fluctuations in the separation process. We will present algorithms for the implementation of this approach, and demonstrate its usefulness using case studies in one and two dimensions. This is joint work with Amol Prakash, and the research groups of Ruedi Aebersold and Amanda Pavlovitch in Seattle.
Database-searching programs generally identify only a fraction of the spectra acquired in a standard LC/MS/MS study of digested proteins. By using a mass-based alignment algorithm of de novo sequencing results, OpenSea can sometimes perform better than this because it can also identify modified peptides. However, OpenSea is dependent on de novo sequencing algorithms that usually cannot derive accurate sequences from low quality MS/MS spectra. Conveniently, many database-searching programs are well suited for matching peptide sequences to low quality data. To leverage this dichotomy, we have developed an algorithm to probabilistically combine the results of multiple search engines, including SEQUEST, Mascot, X!Tandem, and OpenSea. We have found that we normally gain 5% to 20% more MS/MS spectrum identifications with each additional search engine we use, primarily due to increased confidence in low scoring matches. In addition, we use ranked-based clustering to mine information from the remaining spectra. First, we remove redundant results by clustering unmatched spectra to other spectra identified by the database-searching programs. Then we identify potentially interesting unmatched spectra by looking for spectral duplication and using high quality filters. These results are singled out for further modification discovery analysis or manual interpretation.
With the easy availability of ultraprecise mass spectrometrical data, the accurate mass of biomolecules is becoming a physical quantity of high interest in bioanalytical methodology. MALDI/ESI FT-ICR mass spectrometry, especially when combined with convenient and versatile ion manipulating devices such as quadrupolar ion traps, now allows to easily determine the amino acid composition of medium size unknown peptides when employing combinatorial calculations of parent and fragment ion masses. This new method, which in a second step allows to reliably sequence completely unknown peptides ("Composition-Based Sequencing (CBS)" ) appears to open a wide new field of bioanalytical investigation in proteomics.
CBS appears to have some fundamental advantages over common de novo sequencing strategies, since it does not require preknowledge of underlying fragmentation mechanisms or of peptide specific ionization and fragmentation behavior. While classical strategies usually try to verify the presence of expected fragment ion signals, CBS instead interprets the observed accurate mass values of precursor and fragment ions with respect to possible amino acid combinations by means of combinatorial logic. The potential and limitations of the method will be discussed in the light of the expected evolution of high accuracy mass spectrometry in the coming years.
 B. Spengler, De Novo Sequencing, Peptide Composition Analysis and Composition-based Sequencing: A new Strategy Employing Accurate Mass Determination by Fourier Transform Ion Cyclotron Resonance Mass Spectrometry. J Am Soc Mass Spectrom, 15 (2004) 704-715.
Enormous amount of biological data have been accumulated over the years such as sequences, gene expressions, protein physical interactions, genetic interactions, protein complexes, protein localizations, etc. For given biological problems of interest, most data contribute some, but not all the information for the problems. By combining different problems intelligently, we are able to obtain a more complete picture of the problems of interest.
We present two examples of data integration. One is the estimation of reliability of observed protein interaction data sets using gene expressions and protein localizations. The integration of the two data sources can give a more accurate estimation of the reliability. The other example is protein function prediction combining protein interactions, complexes, and features of individual proteins based on a Markov Random Field model. We further study the relationship between gene lethality, protein interaction networks, and protein function annotation.
Joint work with Ting Chen.
Many researchers have reported biomarkers from mass spectrometry (MS) experiments. However, in many cases no experimental design was reported and the biomarkers did not correspond to any reliable signals in the data. In many cases artificial intelligence is used after the data are collected rather than using natural intelligence in designing the experimental measurements. This paper proposes and demonstrates the power of a rational experimental design applied to preliminary experiments directed towards discovery of biomarkers in amniotic fluid.
Combining analysis of variance with principal component analysis (ANOVA/PCA) provides a powerful tool for the discovery of biomarkers in chemical measurements of biological systems. This approach encourages the use of experimental design to separate the variation of the experimental hypothesis from other potentially confounding sources of variation. When the factors of the experiment are greater than the residual error, the variable loadings of the principal components can be interpreted without requiring a supervised rotation and thus avoid the Curse of Dimensionality that occurs with underdetermined data. A series of spectral score plots are obtained for each experimental factor that allows easy interpretation by scientists who may not be proficient in advance mathematical calculations. A conservative statistical test is used been presented to evaluate the significance of the experimental factors. Potential biomarker peaks can be validated through a univariate resolution measure.
Joint work with Nancy E. Vieira, Roberto Romero, and Peter de B. Harrington.