ENG: Bioinformatics: Scholarly Papers
Permanent URI for this collection
Browse
Recent Submissions
Now showing 1 - 20 of 101
Item Fungal microbiota profile in newly-diagnosed treatment-naïve children with Crohn's disease(Oxford University Press, 2016-11-03) Korolev, Kirill; El Mouzan, Mohammad; Wang, Feng; Al Mofarreh, Mohammad; Menon, Rajita; Al Barrag, Ahmad; Al Sarkhy, Ahmad; Al Asmi, Mona; Hamed, Yassin; Saeed, Anjum; Dowd, Scot; Assiri, Asaad A.; Winter, HarlandBACKGROUND & AIMS: although increasing evidence suggests a role for fungi in inflammatory bowel disease (IBD), data are scarce and mostly from adults. Our aim was to define the characteristics of fungal microbiota in newly-diagnosed treatment-naïve children with Crohn disease (CD). METHODS: The children referred for colonoscopy were prospectively enrolled in the study at King Khalid University Hospital, King Saud University and Al Mofarreh Polyclinics in Riyadh. Tissue and stool samples were collected and frozen till sequencing analysis. The children with confirmed CD diagnosis were designated as cases and the others as non- IBD controls. 78 samples were collected from 35 children (15 CD and 20 controls). Statistical analysis was performed to investigate CD associations and diversity. RESULTS: CD-associated fungi varied with the level of phylogenetic tree. There was no significant difference in abundance between normal and inflamed mucosa. Significantly abundant CD-associated taxa included Psathyrellaceae (p=0.01), Cortinariaceae (p= 0.04), Psathyrella (p= 0.004), and Gymnopilus (p=0.03).Monilinia was significantly depleted (p=0.03), whereas other depleted taxa, although not statistically significant, included Leotiomycetes (p= 0.06), Helotiales (p=0.08), Sclerotiniaceae (p=0.07). There was no significant difference in fungal diversity between CD and controls. CONCLUSIONS: We report highly significant fungal dysbiosis in newly diagnosed treatment naïve CD children. Depleted and more abundant taxa suggest anti-inflammatory and proinflamatory potentials respectively. Further studies with larger sample size including functional analysis are needed to clarify the significance of the fungal community in the pathogenesis of CD.Item Recurrent, Robust and Scalable Patterns Underlie Human Approach and Avoidance(Public Library of Science, 2010-5-26) Kim, Byoung Woo; Kennedy, David N.; Lehár, Joseph; Lee, Myung Joo; Blood, Anne J.; Lee, Sang; Perlis, Roy H.; Smoller, Jordan W.; Morris, Robert; Fava, Maurizio; Breiter, Hans C.BACKGROUND. Approach and avoidance behavior provide a means for assessing the rewarding or aversive value of stimuli, and can be quantified by a keypress procedure whereby subjects work to increase (approach), decrease (avoid), or do nothing about time of exposure to a rewarding/aversive stimulus. To investigate whether approach/avoidance behavior might be governed by quantitative principles that meet engineering criteria for lawfulness and that encode known features of reward/aversion function, we evaluated whether keypress responses toward pictures with potential motivational value produced any regular patterns, such as a trade-off between approach and avoidance, or recurrent lawful patterns as observed with prospect theory. METHODOLOGY/PRINCIPAL FINDINGS. Three sets of experiments employed this task with beautiful face images, a standardized set of affective photographs, and pictures of food during controlled states of hunger and satiety. An iterative modeling approach to data identified multiple law-like patterns, based on variables grounded in the individual. These patterns were consistent across stimulus types, robust to noise, describable by a simple power law, and scalable between individuals and groups. Patterns included: (i) a preference trade-off counterbalancing approach and avoidance, (ii) a value function linking preference intensity to uncertainty about preference, and (iii) a saturation function linking preference intensity to its standard deviation, thereby setting limits to both. CONCLUSIONS/SIGNIFICANCE. These law-like patterns were compatible with critical features of prospect theory, the matching law, and alliesthesia. Furthermore, they appeared consistent with both mean-variance and expected utility approaches to the assessment of risk. Ordering of responses across categories of stimuli demonstrated three properties thought to be relevant for preference-based choice, suggesting these patterns might be grouped together as a relative preference theory. Since variables in these patterns have been associated with reward circuitry structure and function, they may provide a method for quantitative phenotyping of normative and pathological function (e.g., psychiatric illness).Item Comparison of Proteomic and Transcriptomic Profiles in the Bronchial Airway Epithelium of Current and Never Smokers(Public Libary of Science, 2009-4-9) Steiling, Katrina; Kadar, Aran Y.; Bergerat, Agnes; Flanigon, James; Sridhar, Sriram; Shah, Vishal; Ahmad, Q. Rushdy; Brody, Jerome S.; Lenburg, Marc E.; Steffen, Martin; Spira, AvrumBACKGROUND. Although prior studies have demonstrated a smoking-induced field of molecular injury throughout the lung and airway, the impact of smoking on the airway epithelial proteome and its relationship to smoking-related changes in the airway transcriptome are unclear. METHODOLOGY/PRINCIPAL FINDINGS. Airway epithelial cells were obtained from never (n=5) and current (n=5) smokers by brushing the mainstem bronchus. Proteins were separated by one dimensional polyacrylamide gel electrophoresis (1D-PAGE). After in-gel digestion, tryptic peptides were processed via liquid chromatography/ tandem mass spectrometry (LC-MS/MS) and proteins identified. RNA from the same samples was hybridized to HG-U133A microarrays. Protein detection was compared to RNA expression in the current study and a previously published airway dataset. The functional properties of many of the 197 proteins detected in a majority of never smokers were similar to those observed in the never smoker airway transcriptome. LC-MS/MS identified 23 proteins that differed between never and current smokers. Western blotting confirmed the smoking-related changes of PLUNC, P4HB1, and uteroglobin protein levels. Many of the proteins differentially detected between never and current smokers were also altered at the level of gene expression in this cohort and the prior airway transcriptome study. There was a strong association between protein detection and expression of its corresponding transcript within the same sample, with 86% of the proteins detected by LC-MS/MS having a detectable corresponding probeset by microarray in the same sample. Forty-one proteins identified by LC-MS/MS lacked detectable expression of a corresponding transcript and were detected in =5% of airway samples from a previously published dataset. CONCLUSIONS/SIGNIFICANCE. 1D-PAGE coupled with LC-MS/MS effectively profiled the airway epithelium proteome and identified proteins expressed at different levels as a result of cigarette smoke exposure. While there was a strong correlation between protein and transcript detection within the same sample, we also identified proteins whose corresponding transcripts were not detected by microarray. This noninvasive approach to proteomic profiling of airway epithelium may provide additional insights into the field of injury induced by tobacco exposure.Item Improving the Precision of the Structure–Function Relationship by Considering Phylogenetic Context(Public Library of Science, 2005-6-24) Shakhnovich, Boris EUnderstanding the relationship between protein structure and function is one of the foremost challenges in post-genomic biology. Higher conservation of structure could, in principle, allow researchers to extend current limitations of annotation. However, despite significant research in the area, a precise and quantitative relationship between biochemical function and protein structure has been elusive. Attempts to draw an unambiguous link have often been complicated by pleiotropy, variable transcriptional control, and adaptations to genomic context, all of which adversely affect simple definitions of function. In this paper, I report that integrating genomic information can be used to clarify the link between protein structure and function. First, I present a novel measure of functional proximity between protein structures (F-score). Then, using F-score and other entirely automatic methods measuring structure and phylogenetic similarity, I present a three-dimensional landscape describing their inter-relationship. The result is a "well-shaped" landscape that demonstrates the added value of considering genomic context in inferring function from structural homology. A generalization of methodology presented in this paper can be used to improve the precision of annotation of genes in current and newly sequenced genomes. Synopsis. The author provides a novel perspective on a key problem of structural biology: the structure–function relationship in proteins. While relatedness in protein structure correlates with general description of function, attempts to use this relationship predictively are often complicated by its ambiguous nature. A structure encoded by a family of sequences may be implicated in a set of diverse functions across a variety of organisms. The author outlines an innovative approach that underlines the importance of considering genomic context when using structure-comparison methods for functional prediction. First, the author defines two distance measures: in genomic space and in function space. Then, the author describes a landscape of functional distance based on both structural and phylogenetic relatedness. It turns out that this landscape forms a "functional well" where proximity occurs when the structures are similar and occur in the same set of genomes. This result may have implications in future research into functional prediction. With the increasing pace of sequence deposition into databanks, this result suggests a simple way to improve functional prediction via structure homology by complementing existing methods with emerging techniques from comparative genomics.Item Integrated Assessment of Genomic Correlates of Protein Evolutionary Rate(Public Library of Science, 2009-6-12) Xia, Yu; Franzosa, Eric A.; Gerstein, Mark B.Rates of evolution differ widely among proteins, but the causes and consequences of such differences remain under debate. With the advent of high-throughput functional genomics, it is now possible to rigorously assess the genomic correlates of protein evolutionary rate. However, dissecting the correlations among evolutionary rate and these genomic features remains a major challenge. Here, we use an integrated probabilistic modeling approach to study genomic correlates of protein evolutionary rate in Saccharomyces cerevisiae. We measure and rank degrees of association between (i) an approximate measure of protein evolutionary rate with high genome coverage, and (ii) a diverse list of protein properties (sequence, structural, functional, network, and phenotypic). We observe, among many statistically significant correlations, that slowly evolving proteins tend to be regulated by more transcription factors, deficient in predicted structural disorder, involved in characteristic biological functions (such as translation), biased in amino acid composition, and are generally more abundant, more essential, and enriched for interaction partners. Many of these results are in agreement with recent studies. In addition, we assess information contribution of different subsets of these protein properties in the task of predicting slowly evolving proteins. We employ a logistic regression model on binned data that is able to account for intercorrelation, non-linearity, and heterogeneity within features. Our model considers features both individually and in natural ensembles ("meta-features") in order to assess joint information contribution and degree of contribution independence. Meta-features based on protein abundance and amino acid composition make strong, partially independent contributions to the task of predicting slowly evolving proteins; other meta-features make additional minor contributions. The combination of all meta-features yields predictions comparable to those based on paired species comparisons, and approaching the predictive limit of optimal lineage-insensitive features. Our integrated assessment framework can be readily extended to other correlational analyses at the genome scale. Author Summary Proteins encoded within a given genome are known to evolve at drastically different rates. Through recent large-scale studies, researchers have measured a wide variety of properties for all proteins in yeast. We are interested to know how these properties relate to one another and to what extent they explain evolutionary rate variation. Protein properties are a heterogeneous mix, a factor which complicates research in this area. For example, some properties (e.g., protein abundance) are numerical, while others (e.g., protein function) are descriptive; protein properties may also suffer from noise and hidden redundancies. We have addressed these issues within a flexible and robust statistical framework. We first ranked a large list of protein properties by the strength of their relationships with evolutionary rate; this confirms many known evolutionary relationships and also highlights several new ones. Similar protein properties were then grouped and applied to predict slowly evolving proteins. Some of these groups were as effective as paired species comparison in making correct predictions, although in both cases a great deal of evolutionary rate variation remained to be explained. Our work has helped to refine the set of protein properties that researchers should consider as they investigate the mechanisms underlying protein evolution.Item Bioinformatics Analysis of Macrophages Exposed to Porphyromonas gingivalis: Implications in Acute vs. Chronic Infections(Public Library of Science, 2010-12-23) Yu, Wen-Han; Hu, Han; Zhou, Qingde; Xia, Yu; Amar, SalomonBACKGROUND. Periodontitis is the most common human infection affecting tooth-supporting structures. It was shown to play a role in aggravating atherosclerosis. To deepen our understanding of the pathogenesis of this disease, we exposed human macrophages to an oral bacteria, Porphyromonas gingivalis (P. gingivalis), either as live bacteria or its LPS or fimbria. Microarray data from treated macrophages or control cells were analyzed to define molecular signatures. Changes in genes identified in relevant pathways were validated by RT-PCR. METHODOLOGY/PRINCIPAL FINDINGS. We focused our analysis on three important groups of genes. Group PG (genes differentially expressed by live bacteria only); Group LFG (genes differentially expressed in response to exposure to LPS and/or FimA); Group CG (core gene set jointly activated by all 3 stimulants). A total of 842 macrophage genes were differentially expressed in at least one of the three conditions compared to naïve cells. Using pathway analysis, we found that group CG activates the initial phagocytosis process and induces genes relevant to immune response, whereas group PG can de-activate the phagocytosis process associated with phagosome-lysosome fusion. LFG mostly affected RIG-I-like receptor signaling pathway. CONCLUSION/SIGNIFICANCE. In light of the fact that acute infections involve live bacteria while chronic infections involve a combination of live bacteria and their byproducts, group PG could represent acute P. gingivalis infection while group LFG could represent chronic P. gingivalis infection. Group CG may be associated with core immune pathways, triggered irrespective of the specific stimulants and indispensable to mount an appropriate immune response. Implications in acute vs. chronic infection are discussed.Item Many Sequence-Specific Chromatin Modifying Protein-Binding Motifs Show Strong Positional Preferences for Potential Regulatory Regions in the Saccharomyces Cerevisiae Genome(Oxford University Press, 2010-01-04) Hansen, Loren; Mariño-Ramírez, Leonardo; Landsman, DavidInitiation and regulation of gene expression is critically dependent on the binding of transcriptional regulators, which is often temporal and position specific. Many transcriptional regulators recognize and bind specific DNA motifs. The length and degeneracy of these motifs results in their frequent occurrence within the genome, with only a small subset serving as actual binding sites. By occupying potential binding sites, nucleosome placement can specify which sequence motif is available for DNA-binding regulatory factors. Therefore, the specification of nucleosome placement to allow access to transcriptional regulators whenever and wherever required is critical. We show that many DNA-binding motifs in Saccharomyces cerevisiae show a strong positional preference to occur only in potential regulatory regions. Furthermore, using gene ontology enrichment tools, we demonstrate that proteins with binding motifs that show the strongest positional preference also have a tendency to have chromatin-modifying properties and functions. This suggests that some DNA-binding proteins may depend on the distribution of their binding motifs across the genome to assist in the determination of specificity. Since many of these DNA-binding proteins have chromatin remodeling properties, they can alter the local nucleosome structure to a more permissive and/or restrictive state, thereby assisting in determining DNA-binding protein specificity.Item UniPROBE, Update 2011: Expanded Content and Search Tools in the Online Database of Protein-Binding Microarray Data on Protein–DNA Interactions(Oxford University Press, 2010-10-30) Robasky, Kimberly; Bulyk, Martha L.The Universal PBM Resource for Oligonucleotide-Binding Evaluation (UniPROBE) database is a centralized repository of information on the DNA-binding preferences of proteins as determined by universal protein-binding microarray (PBM) technology. Each entry for a protein (or protein complex) in UniPROBE provides the quantitative preferences for all possible nucleotide sequence variants ('words') of length k ('k-mers'), as well as position weight matrix (PWM) and graphical sequence logo representations of the k-mer data. In this update, we describe <130% expansion of the database content, incorporation of a protein BLAST (blastp) tool for finding protein sequence matches in UniPROBE, the introduction of UniPROBE accession numbers and additional database enhancements. The UniPROBE database is available at http://uniprobe.org.Item Nucleic Acids Research Annual Web Server Issue in 2009(2009-7-1) Benson, GaryItem Editorial(Oxford University Press, 2008-7-1) Benson, GaryItem Systematic Variation in mRNA 3′-Processing Signals during Mouse Spermatogenesis(Oxford University Press, 2006-12-08) Liu, Donglin; Brockman, J. Michael; Dass, Brinda; Hutchins, Lucie N.; Singh, Priyam; McCarrey, John R.; MacDonald, Clinton C.; Graber, Joel H.Gene expression and processing during mouse male germ cell maturation (spermatogenesis) is highly specialized. Previous reports have suggested that there is a high incidence of alternative 3′-processing in male germ cell mRNAs, including reduced usage of the canonical polyadenylation signal, AAUAAA. We used EST libraries generated from mouse testicular cells to identify 3′-processing sites used at various stages of spermatogenesis (spermatogonia, spermatocytes and round spermatids) and testicular somatic Sertoli cells. We assessed differences in 3′-processing characteristics in the testicular samples, compared to control sets of widely used 3′-processing sites. Using a new method for comparison of degenerate regulatory elements between sequence samples, we identified significant changes in the use of putative 3′-processing regulatory sequence elements in all spermatogenic cell types. In addition, we observed a trend towards truncated 3′-untranslated regions (3′-UTRs), with the most significant differences apparent in round spermatids. In contrast, Sertoli cells displayed a much smaller trend towards 3′-UTR truncation and no significant difference in 3′-processing regulatory sequences. Finally, we identified a number of genes encoding mRNAs that were specifically subject to alternative 3′-processing during meiosis and postmeiotic development. Our results highlight developmental differences in polyadenylation site choice and in the elements that likely control them during spermatogenesis.Item Predicting Eukaryotic Transcriptional Cooperativity by Bayesian Network Integration of Genome-Wide Data(2009-10) Wang, Yong; Zhang, Xiang-Sun; Xia, YuTranscriptional cooperativity among several transcription factors (TFs) is believed to be the main mechanism of complexity and precision in transcriptional regulatory programs. Here, we present a Bayesian network framework to reconstruct a high-confidence whole-genome map of transcriptional cooperativity in Saccharomyces cerevisiae by integrating a comprehensive list of 15 genomic features. We design a Bayesian network structure to capture the dominant correlations among features and TF cooperativity, and introduce a supervised learning framework with a well-constructed gold-standard dataset. This framework allows us to assess the predictive power of each genomic feature, validate the superior performance of our Bayesian network compared to alternative methods, and integrate genomic features for optimal TF cooperativity prediction. Data integration reveals 159 high-confidence predicted cooperative relationships among 105 TFs, most of which are subsequently validated by literature search. The existing and predicted transcriptional cooperativities can be grouped into three categories based on the combination patterns of the genomic features, providing further biological insights into the different types of TF cooperativity. Our methodology is the first supervised learning approach for predicting transcriptional cooperativity, compares favorably to alternative unsupervised methodologies, and can be applied to other genomic data integration tasks where high-quality gold-standard positive data are scarce.Item Large-Scale Identification of Genetic Design Strategies Using Local Search(Nature Publishing Group, 2009-08-18) Lun, Desmond S.; Rockwell, Graham; Guido, Nicholas J.; Baym, Michael; Kelner, Jonathan A.; Berger, Bonnie; Galagan, James E.; Church, George M.In the past decade, computational methods have been shown to be well suited to unraveling the complex web of metabolic reactions in biological systems. Methods based on flux–balance analysis (FBA) and bi-level optimization have been used to great effect in aiding metabolic engineering. These methods predict the result of genetic manipulations and allow for the best set of manipulations to be found computationally. Bi-level FBA is, however, limited in applicability because the required computational time and resources scale poorly as the size of the metabolic system and the number of genetic manipulations increase. To overcome these limitations, we have developed Genetic Design through Local Search (GDLS), a scalable, heuristic, algorithmic method that employs an approach based on local search with multiple search paths, which results in effective, low-complexity search of the space of genetic manipulations. Thus, GDLS is able to find genetic designs with greater in silico production of desired metabolites than can feasibly be found using a globally optimal search and performs favorably in comparison with heuristic searches based on evolutionary algorithms and simulated annealing.Item The Interaction Map of Yeast: Terra Incognita?(BioMed Central, 2006-6-8) Mellor, Joe; DeLisi, CharlesA systematic curation of the literature on Saccharomyces cerevisiae has yielded a comprehensive collection of experimentally observed interactions. This new resource augments current views of the topological structure of yeast's physical and genetic networks, but also reveals that existing studies cover only a fraction of the cell.Item Position-Dependent Motif Characterization Using Non-Negative Matrix Factorization(Oxford University Press, 2008-10-13) Hutchins, Lucie N.; Murphy, Sean M.; Singh, Priyam; Graber, Joel H.Motivation: Cis-acting regulatory elements are frequently constrained by both sequence content and positioning relative to a functional site, such as a splice or polyadenylation site. We describe an approach to regulatory motif analysis based on non-negative matrix factorization (NMF). Whereas existing pattern recognition algorithms commonly focus primarily on sequence content, our method simultaneously characterizes both positioning and sequence content of putative motifs. Results: Tests on artificially generated sequences show that NMF can faithfully reproduce both positioning and content of test motifs. We show how the variation of the residual sum of squares can be used to give a robust estimate of the number of motifs or patterns in a sequence set. Our analysis distinguishes multiple motifs with significant overlap in sequence content and/or positioning. Finally, we demonstrate the use of the NMF approach through characterization of biologically interesting datasets. Specifically, an analysis of mRNA 3′-processing (cleavage and polyadenylation) sites from a broad range of higher eukaryotes reveals a conserved core pattern of three elements. Contact: joel.graber@jax.org Supplementary information: Supplementary data are available at Bioinformatics online.Item Small but Versatile: The Extraordinary Functional and Structural Diversity of the β-grasp Fold(BioMed Central, 2007-7-2) Burroughs, A. Maxwell; Balaji, S.; Iyer, Lakshminarayan M.; Aravind, L.BACKGROUND. The β-grasp fold (β-GF), prototyped by ubiquitin (UB), has been recruited for a strikingly diverse range of biochemical functions. These functions include providing a scaffold for different enzymatic active sites (e.g. NUDIX phosphohydrolases) and iron-sulfur clusters, RNA-soluble-ligand and co-factor-binding, sulfur transfer, adaptor functions in signaling, assembly of macromolecular complexes and post-translational protein modification. To understand the basis for the functional versatility of this small fold we undertook a comprehensive sequence-structure analysis of the fold and developed a natural classification for its members. RESULTS. As a result we were able to define the core distinguishing features of the fold and numerous elaborations, including several previously unrecognized variants. Systematic analysis of all known interactions of the fold showed that its manifold functional abilities arise primarily from the prominent β-sheet, which provides an exposed surface for diverse interactions or additionally, by forming open barrel-like structures. We show that in the β-GF both enzymatic activities and the binding of diverse co-factors (e.g. molybdopterin) have independently evolved on at least three occasions each, and iron-sulfur-cluster-binding on at least two independent occasions. Our analysis identified multiple previously unknown large monophyletic assemblages within the β-GF, including one which unifies versions found in the fasciclin-1 superfamily, the ribosomal protein L25, the phosphoribosyl AMP cyclohydrolase (HisI) and glutamine synthetase. We also uncovered several new groups of β-GF domains including a domain found in bacterial flagellar and fimbrial assembly components, and 5 new UB-like domains in the eukaryotes. CONCLUSION. Evolutionary reconstruction indicates that the β-GF had differentiated into at least 7 distinct lineages by the time of the last universal common ancestor of all extant organisms, encompassing much of the structural diversity observed in extant versions of the fold. The earliest β-GF members were probably involved in RNA metabolism and subsequently radiated into various functional niches. Most of the structural diversification occurred in the prokaryotes, whereas the eukaryotic phase was mainly marked by a specific expansion of the ubiquitin-like β-GF members. The eukaryotic UB superfamily diversified into at least 67 distinct families, of which at least 19–20 families were already present in the eukaryotic common ancestor, including several protein and one lipid conjugated forms. Another key aspect of the eukaryotic phase of evolution of the β-GF was the dramatic increase in domain architectural complexity of proteins related to the expansion of UB-like domains in numerous adaptor roles. REVIEWERS. This article was reviewed by Igor Zhulin, Arcady Mushegian and Frank Eisenhaber.Item Towards a Holistic, Yet Gene-Centered Analysis of Gene Expression Profiles: A Case Study of Human Lung Cancers(Hindawi Publishing Corporation, 2006-11-2) Guo, Yuchun; Eichler, Gabriel S.; Feng, Ying; Ingber, Donald E.; Huang, SuiGenome-wide gene expression profile studies encompass increasingly large number of samples, posing a challenge to their presentation and interpretation without losing the notion that each transcriptome constitutes a complex biological entity. Much like pathologists who visually analyze information-rich histological sections as a whole, we propose here an integrative approach. We use a self-organizing maps -based software, the gene expression dynamics inspector (GEDI) to analyze gene expression profiles of various lung tumors. GEDI allows the comparison of tumor profiles based on direct visual detection of transcriptome patterns. Such intuitive "gestalt" perception promotes the discovery of interesting relationships in the absence of an existing hypothesis. We uncovered qualitative relationships between squamous cell tumors, small-cell tumors, and carcinoid tumor that would have escaped existing algorithmic classifications. These results suggest that GEDI may be a valuable explorative tool that combines global and gene-centered analyses of molecular profiles from large-scale microarray experiments.Item A Novel Superfamily Containing the β-Grasp Fold Involved in Binding Diverse Soluble Ligands(BioMed Central, 2007-1-24) Burroughs, A. Maxwell; Balaji, S.; Iyer, Lakshminarayan M.; Aravind, L.BACKGROUND. Domains containing the β-grasp fold are utilized in a great diversity of physiological functions but their role, if any, in soluble or small molecule ligand recognition is poorly studied. RESULTS. Using sensitive sequence and structure similarity searches we identify a novel superfamily containing the β-grasp fold. They are found in a diverse set of proteins that include the animal vitamin B12 uptake proteins transcobalamin and intrinsic factor, the bacterial polysaccharide export proteins, the competence DNA receptor ComEA, the cob(I)alamin generating enzyme PduS and the Nqo1 subunit of the respiratory electron transport chain. We present evidence that members of this superfamily are likely to bind a range of soluble ligands, including B12. There are two major clades within this superfamily, namely the transcobalamin-like clade and the Nqo1-like clade. The former clade is typified by an insert of a β-hairpin after the helix of the β-grasp fold, whereas the latter clade is characterized by an insert between strands 4 and 5 of the core fold. CONCLUSION. Members of both clades within this superfamily are predicted to interact with ligands in a similar spatial location, with their specific inserts playing a role in the process. Both clades are widely represented in bacteria suggesting that this superfamily was derived early in bacterial evolution. The animal lineage appears to have acquired the transcobalamin-like proteins from low GC Gram-positive bacteria, and this might be correlated with the emergence of the ability to utilize B12 produced by gut bacteria. REVIEWERS. This article was reviewed by Andrei Osterman, Igor Zhulin, and Arcady Mushegian.Item High-Precision High-Coverage Functional Inference from Integrated Data Sources(BioMed Central, 2008-2-25) Linghu, Bolan; Snitkin, Evan S.; Holloway, Dustin T.; Gustafson, Adam M.; Xia, Yu; DeLisi, CharlesBACKGROUND. Information obtained from diverse data sources can be combined in a principled manner using various machine learning methods to increase the reliability and range of knowledge about protein function. The result is a weighted functional linkage network (FLN) in which linked neighbors share at least one function with high probability. Precision is, however, low. Aiming to provide precise functional annotation for as many proteins as possible, we explore and propose a two-step framework for functional annotation (1) construction of a high-coverage and reliable FLN via machine learning techniques (2) development of a decision rule for the constructed FLN to optimize functional annotation. RESULTS. We first apply this framework to Saccharomyces cerevisiae. In the first step, we demonstrate that four commonly used machine learning methods, Linear SVM, Linear Discriminant Analysis, Naïve Bayes, and Neural Network, all combine heterogeneous data to produce reliable and high-coverage FLNs, in which the linkage weight more accurately estimates functional coupling of linked proteins than use individual data sources alone. In the second step, empirical tuning of an adjustable decision rule on the constructed FLN reveals that basing annotation on maximum edge weight results in the most precise annotation at high coverages. In particular at low coverage all rules evaluated perform comparably. At coverage above approximately 50%, however, they diverge rapidly. At full coverage, the maximum weight decision rule still has a precision of approximately 70%, whereas for other methods, precision ranges from a high of slightly more than 30%, down to 3%. In addition, a scoring scheme to estimate the precisions of individual predictions is also provided. Finally, tests of the robustness of the framework indicate that our framework can be successfully applied to less studied organisms. CONCLUSION. We provide a general two-step function-annotation framework, and show that high coverage, high precision annotations can be achieved by constructing a high-coverage and reliable FLN via data integration followed by applying a maximum weight decision rule.Item EGenBio: A Data Management System for Evolutionary Genomics and Biodiversity(BioMed Central, 2006-9-26) Nahum, Laila A.; Reynolds, Matthew T.; Wang, Zhengyuan O.; Faith, Jeremiah J.; Jonna, Rahul; Jiang, Zhi J.; Meyer, Thomas J.; Pollock, David D.BACKGROUND. Evolutionary genomics requires management and filtering of large numbers of diverse genomic sequences for accurate analysis and inference on evolutionary processes of genomic and functional change. We developed Evolutionary Genomics and Biodiversity (EGenBio; ) to begin to address this. DESCRIPTION. EGenBio is a system for manipulation and filtering of large numbers of sequences, integrating curated sequence alignments and phylogenetic trees, managing evolutionary analyses, and visualizing their output. EGenBio is organized into three conceptual divisions, Evolution, Genomics, and Biodiversity. The Genomics division includes tools for selecting pre-aligned sequences from different genes and species, and for modifying and filtering these alignments for further analysis. Species searches are handled through queries that can be modified based on a tree-based navigation system and saved. The Biodiversity division contains tools for analyzing individual sequences or sequence alignments, whereas the Evolution division contains tools involving phylogenetic trees. Alignments are annotated with analytical results and modification history using our PRAED format. A miscellaneous Tools section and Help framework are also available. EGenBio was developed around our comparative genomic research and a prototype database of mtDNA genomes. It utilizes MySQL-relational databases and dynamic page generation, and calls numerous custom programs. CONCLUSION. EGenBio was designed to serve as a platform for tools and resources to ease combined analysis in evolution, genomics, and biodiversity.