Data

Passattuto

This is the data used to evaluate the decoy database approach and FDR calculation.

GNPS: contains the raw GNPS files as well as decoy databases with ConditionalPeaks and RandomPeaks based on the raw data.

GNPS noise filtered: contains the noise-filtered GNPS files and trees as well as decoy databases with RandomPeaks, ConditionalPeaks and Reroot based on the noise filtered data

Massbank query: contains the Massbank Orbitrap query spectra.

Search results: contains an example of a search for Massbank spectra vs. noise filtered GNPS (MassbankOrbi-Gnps.txt), a search for Massbank spectra vs. a noise filtered decoy database (MassbankOrbi-GnpsDecoyConditionalPeaks.txt) as well as the q-values for the respective TDA (MassbankOrbi-Gnps_qValues_TDA_RandomPeaks.txt) and EBA (MassbankOrbi-GnpsDecoyConditionalPeaks.txt).

Uncommon element prediction

This is the data used in [1] to evaluate the classifiers for predicting the presence of uncommon elements in unknown biomolecules.

The myxo data set consists of 88 isotope patterns measured on a Bruker MaXis 2G qTOF spectrometer (Bremen, Germany). You can download the original spectra as well as the filtered isotope patterns which served as input for our method. The filtered spectra contain mass, intensity, resolution and FWHM information.

Further, we provide the simulated isotope patterns which were used for evaluation. Patterns were generated twice: with a standard noise level and with a high noise level. Files are named according to <classifier><pattern length><pos/neg>.txt, where pos/neg indicates, whether the file contains positive or negative examples for the respective classifier (not an ionization mode).

References

[1] M. Meusel, F. Hufsky, F. Panter, D. Krug, R. Müller, and S. Böcker
Predicting the presence of uncommon elements in unknown biomolecules from isotope patterns
Anal Chem,published ahead of print, 2016, DOI: 10.1021/acs.analchem.6b01015

CSI:FingerId evaluation data

These is the data used in [1] to evaluate the performance of tools for structure identification using structure databases and tandem MS. We used spectra from MassBank and GNPS for evaluation and training and PubChem for searching. Please consider that these databases are updated frequently. Therefore, we provide the IDs of the used spectra and compounds at the time of our evaluation.

PubChem candidate lists, Evaluation data IDs, crossvalidation IDs
Result lists for GNPS and MassBank known/novel

For MassBank and GNPS we also provide the spectra files used in our evaluation. We preprocessed this spectra as described in [1]. You can always refer to the GNPS and MassBank database for the original spectra. Please mention and cite MassBank and GNPS when using these data.

CSI:FingerId evaluation data (spectra)

Here we provide the CASMI 2016 challenges in SIRIUS readable ms format:

CASMI 2016

References

[1]  Kai Dührkop, Huibin Shen, Marvin Meusel, Juho Rousu, and Sebastian Böcker.
Searching molecular structure databases with tandem mass spectra using CSI:FingerID
PNAS 2015 ; published ahead of print September 21, 2015, doi:10.1073/pnas.1509788112

Bad Clade Deletion Supertrees evaluation data

The data we have used to evaluate BCD Supertrees.

Simulated  datasets  (alignments, input trees and model trees):
2017-BCD-simulated-datasets
Results (supertrees, combined analysis trees) for simulated datasets:
2017-BCD-simulated-results
Results and input data for biological datasets
2017-BCDSupertrees-biological-datasets+results

References

[1] Markus Fleischauer and Sebastian Böcker.
Bad Clade Deletion Supertrees
under review

Greedy Strict Consensus Merger evaluation data

This is the data that we have used to evaluate the different GSCM scorings and implementations.

SMIDGenOG and SMIDGen  dataset including input and model trees:
2016-GSCM-peerJ-datasets
GSCM results (supertrees) forSMIDGenOG and SMIDGen  dataset:
2016-GSCM-peerJ-supertrees
All scores, rates and stats as CSV:
2016-GSCM-peerJ-scores

References

[1] Markus Fleischauer and Sebastian Böcker.
Collecting reliable clades using the Greedy Strict Consensus Merger.
PeerJ (2016) 4:e2172 https://doi.org/10.7717/peerj.2172

GC-MS EI Fragmentation Spectra

Rapid identification of small compounds from small amounts of substance is of interest in many areas of biology and medicine. Mass spectrometry (MS) coupled with gas chromatography (GC–MS) is a key technology for the identification of small molecules. We have presented a novel computational method for the de novo interpretation of high resolution EI fragmentation data of small molecules, that cannot be found in any, not even a structural, database. The resulting fragmentation trees explain relevant fragmentation reactions and assign molecular formulas to fragments. The method enables the identification of the molecular ion and the molecular formula of a metabolite. We evaluated the method on a selection of 50 derivatized and underivatized metabolites that can be downloaded here. For this dataset, please cite:

References

Franziska Hufsky, Martin Rempt, Florian Rasche, Georg Pohnert and Sebastian Böcker
De Novo Analysis of Electron Impact Mass Spectra Using Fragmentation Trees.
Anal Chim Acta, 739:67-76, 2012.

LC Tandem MS Fragmentation Spectra

In principle, tandem mass spectrometry allows us to identify “unknown” small molecules not in any database. Fragmentation trees have recently been introduced for the automated analysis of the fragmentation patterns of small molecules. We have presented a method for the automated comparison of such fragmentation patterns, based on aligning the compounds’ fragmentation trees. We clustered compounds based solely on their fragmentation patterns. We then presented a tool for searching a database for compounds with fragmentation pattern similar to an unknown sample compound.

The method has been evaluated on different datasets, many of which can be downloaded from MassBank: The Orbitrap dataset has accession numbers CE000001 to CE000694. Some of the spectra have previously been used for evaluation in [2]. If you use this dataset, please cite [2] (accession numbers CE000001 to CE000193) and [1] (accession numbers CE000194 to CE000694).

The QStar dataset has been uploaded to MassBank but, unfortunately, this has been done as part of a larger batch upload. To this end, the corresponding spectra can be found among accession numbers PB000001 to PB000999. If you use this dataset, please cite [2].

The MassBank dataset used in [1] has accession numbers PR100001 to PR101056.

References

[1] Florian Rasche, Kerstin Scheubert, Franziska Hufsky, Thomas Zichner, Marco Kai, Aleš Svatoš and Sebastian Böcker.
Identifying the unknowns by aligning fragmentation trees.
Anal Chem, 84(7):3417-3426, 2012.

[2] Florian Rasche, Aleš Svatoš, Ravi Kumar Maddula, Christoph Böttcher and Sebastian Böcker
Computing fragmentation trees from tandem mass spectrometry data.
Anal Chem, 83(4):1243-1251, 2011.

Gene Cluster

Genes occurring co-localized in multiple genomes can be strong indicators for either functional constraints on the genome organization or remnant ancestral gene order. This conserved patterns are usually referred to as gene clusters.
We have presented a method to efficiently compute gene clusters in hundreds of genomes and estimate the significance of the computed gene cluster predictions. To evaluate our software, we generated a dataset of 678 genomes from RefSeq. The contained genes were grouped into gene families based on clusters of orthologous groups(COG) and `non-supervised
orthologous groups(NOG) taken from the STRING database. The combined data used in [1] in the Gecko3 input format can be downloaded here. Computed gene clusters from [1] are available for default and relaxed parameters.

References

[1] Sascha Winter, Katharina Jahn, Stefanie Wehner, Leon Kuchenbecker, Jens Stoye, Manja Marz, Sebastian Böcker
Finding approximate gene clusters with Gecko3.
To be published.

[2] Katharina Jahn, Sascha Winter, Jens Stoye and Sebastian Böcker
Statistics for approximate gene clusters.
BMC Bioinformatics, 14(Suppl 15):S14, 2013. Proc. of RECOMB Satelite Workshop on Comparative Genomics (RECOMB-CG 2013).

Center Strings

Finding the center string is a classical computer science problem with important applications in computational biology. We focus on exact methods that are also swift in application and present an advanced preprocessing and a new iterative search strategy.
The data set used for evaluation can be downloaded here. For this dataset, please cite:

References

Franziska Hufsky, Léon Kuchenbecker, Katharina Jahn, Jens Stoye, and Sebastian Böcker.
Swiftly computing center strings.
BMC Bioinformatics, 12:106, 2011.

Cluster Editing evaluation data

This is the data that we have used to evaluate our Cluster Editing software.

Weighted graphs of size up to 100 derived from COG protein similarity data:
biological_bielefeld.zip (26 MB). Please cite [1, 2].

A set of randomly generated weighted graphs:
artificial_bielefeld_v_10to100.zip (980 kB). Please cite [1-4].

A set of randomly generated unweighted graphs:
samples_v_100to2000_k_025vto25v.zip (196 MB) and samples_with_k_2000_to_4000.zip (186 MB). Please cite [5-6].

Input graphs after parameter-independent data reduction:
reduced_samples_v_100to1000_k_05to10v.zip (24 MB). Please cite [5-6].

References

[1] Sven Rahmann, Tobias Wittkop, Jan Baumbach, Marcel Martin, Anke Truss, and Sebastian Böcker.
Exact and Heuristic Algorithms for Weighted Cluster Editing.
Proc. of Computational Systems Bioinformatics (CSB 2007), volume 6, pages 391—401, 2007.

[2] Sebastian Böcker, Sebastian Briesemeister, Quang Bao Anh Bui, and Anke Truss.
A fixed-parameter approach for Weighted Cluster Editing.
Proc. of Asia-Pacific Bioinformatics Conference (APBC 2008), volume 5, pages 211—220 of Series on Advances in Bioinformatics and Computational Biology, Imperial College Press, 2008.

[3] Sebastian Böcker, Sebastian Briesemeister, Quang Bao Anh Bui, and Anke Truss
Going Weighted: Parameterized Algorithms for Cluster Editing.
Proc. of Conference on Combinatorial Optimization and Applications (COCOA 2008), volume 5165, pages 1—12 of Lect. Notes Comput. Sc., Springer, 2008.

[4] Sebastian Böcker, Sebastian Briesemeister, Quang Bao Anh Bui and Anke Truss
Going Weighted: Parameterized Algorithms for Cluster Editing.
Theor Comput Sci, 410(52):5467-5480, 2009.

[5] Sebastian Böcker, Sebastian Briesemeister, and Gunnar W. Klau
Exact Algorithms for Cluster Editing: Evaluation and Experiments.
Proc. of Workshop on Experimental Algorithms (WEA 2008), volume 5038, pages 289—302 of Lect. Notes Comput. Sc., Springer, 2008.

[6] Sebastian Böcker, Sebastian Briesemeister and Gunnar W. Klau
Exact Algorithms for Cluster Editing: Evaluation and Experiments.
Algorithmica, 60(2):316-334, 2011.

Faster mass decomposition evaluation data

This is the data that we have used to evaluate our mass decomposition algorithm. (Update 29.01.2015: Made separate files for each parameter configuration).

Orbitrap, Eawag, Hill, MassBank (2 MB). Please cite [1].

References

[1] Kai Dührkop, Marcus Ludwig, Marvin Meusel and Sebastian Böcker.
Faster mass decomposition.
In Proc. of Workshop on Algorithms in Bioinformatics (WABI 2013), volume 8126 of Lect Notes Comput Sci, pages 45-58. Springer, Berlin, 2013.