Identifying the Unknowns: Towards structural elucidation of small molecules using mass spectrometry (IDUN)
Rapid identification of small compounds from small amounts of substance is of interest in many areas of biology and medicine such as metabolomics, bio-prospecting, biomarker discovery, and diagnostics. Today, mass spectrometry (MS) is a key technology for the identification of small molecules. Due to immense technological advances in mass spectrometry over the last years, the amount and complexity of MS data has been growing rapidly. Tandem mass spectrometry offers the potential of identifying small molecules solely from MS fingerprints, but computational analysis of such data is in its infancy, and this is presumed to be one of the major technological hurdles in metabolomics today.
The key objective of this project is to develop computational methods to “identify the unknowns”, that is, compounds that cannot be found in any database. For this, we are developing new computational techniques, models, and algorithms for the interpretation of MS fragmentation data from small molecules. We have developed a BLAST-like search tool for searching fragmentation patterns, called fragmentation tree BLAST (FT-BLAST). The figure shows an excerpt from an FT-BLAST result lists in a leave-one-out evaluation.
Funded by DFG project BO 1910/10.
This project is realized in cooperation with the AG Genome Informatics of Prof. Dr. Jens Stoye at Bielefeld University, Germany.
The order of genes in genomes can be used to determine the function of unknown genes, as well as the phylogenetic history of the organisms. In view of the ever-increasing speed of genome sequencing, there exists a huge amount of data for such studies. On the algorithmic side, though, methods are often based on overly simplified genome models, use heuristics to solve optimization problems, or suffer from long running times.
Gene clusters are sets of genes that occur as single contiguous blocks in several genomes. Unfortunately, the requirement of exact occurrences of gene clusters turns out to be too strict for the biological application. In this project, we are developing models and algorithms for the computation of approximate gene clusters, that combine a formal strictness with applicability to biological data. At the same time, our algorithms must be swift to allow application to the increasing amount of genome data. We are combining methods from combinatorial optimization and algorithmic graph theory with a statistically sound evaluation.
Funded by DFG project BO 1910/8.
Supertree approaches combine input trees with overlapping taxa sets into one large and more comprehensive tree. Here, the problem is dealing with incompatible data in a reasonable way. Supertree methods can be used as part of a divide-and-conquer approach for phylogenetic tree inference: (1) we compute, say, maximum likelihood (ML) trees for several subsets of taxa; (2) we merge the resulting trees into one supertree; (3) ﬁnally, we use this supertree as a seed for an ML analysis on the complete set of taxa. Current supertree methods suffer from being too slow or too inaccurate when analyzing large datasets. We have recently developed a novel supertree called FLIPCUT supertrees, which is very swift and computes supertrees of good quality. In this project, we are further advancing this methods so that supertree quality is on a par with the current supertree “gold standard” Matrix Representation with Parsimony (MRP). At the same time, we want to preserve the extraordinary swiftness of the method. We will then integrate the resulting method into a divide-and-conquer approach for phylogenetic reconstruction.
The figure shows the workflow of the FlipCut supertree algorithm.
Funded by DFG project BO 1910/12.
This project aims at solving NP-hard bioinformatics problems using fixed-parameter algorithmics. Using a careful analysis of the problem structure, exact solutions to several seemingly intractable problems come into reach. The main idea of fixed-parameter algorithms is to confine the exponential part of the running time to a preferably small parameter individually chosen for the problem. Despite longer running times, computing exact solutions to NP-hard problems in bioinformatics can pay off, as it may enable a better analysis of data from expensive and laborious experiments. Biological problems often feature characteristics that can be exploited towards a fixed-parameter algorithm.
The figure shows a parameter-independent data reduction for the Cluster Editing problem on the well-studied Leukemia dataset (Golub et al., Science, 1999).
Funded by DFG project BO 1910/9.