CANOPUS evaluation data

We evaluated CANOPUS on two MS/MS reference datasets: The SVM training dataset, which was also used for training CSI:FingerID (in 10-fold cross-validation), and the Agilent MassHunter library, used as indepenent dataset.
The SVM training dataset contains spectra from GNPS, MassBank, and NIST17. As NIST17 is a commercial library, we can only provide the spectra from GNPS and MassBank. The public part of the SVM training dataset can be downloaded here: svm_training_data

For training the deep neural network we used a subset of PubChem with 1,106,938 structures for which we got ClassyFire annotations. The dataset, together with ClassyFire annotations for the evaluation data, can be downloaded here: structures.csv

With CANOPUS, we analyzed data from two biological studies; the mzML and mzXML files are available at MassIVE with the accession numbers MSV000079949 (mice data, Quinn et al 2020) and MSV000081082 (Euphorbia plant data, Ernst et al 2019).

The network visualization of the mice data was done using Cytoscape (Shannon et al 2003). The cytoscape file can be downloaded here: mice_cytoscape

The source code of CANOPUS is part of the SIRIUS github repository ( The scripts we used for analyzing and visualizing the data are available at the github repository (

Studying the charge migration fragmentation of sodiated molecules in collision-induced fragmentation at the library scale.

We analyzed fragmentation spectra from the NIST 17 MS/MS library to assess the extend of protonated fragments in [M+Na]+ fragmentation spectra. We provide a zip archive containing summary statistics of the data and Python 3 scripts used for plotting and evaluation. Spectra and molecular structures are not provided since NIST 17 is not a public library.

ZODIAC evaluation data

We use five datasets to evaluate ZODIAC performance. The input mzML/mzXML files for the five datasets are available at MassIVE, with the following accession numbers: dendroides (MSV000080502), NIST1950 (MSV000081364), tomato (MSV000081463), diatoms (MSV000081731) and mice stool dataset (MSV000079949).
The code of the OpenMS version used for processing the input mzML files can be found on GitHub. The SIRIUS and ZODIAC JAVA code can be found on GitHub as well: SIRIUS/ZODIAC library and frontend.
We provide the processed spectrum files, SIRIUS and ZODIAC results here (3.5Gb). This virtual machine (4.5Gb), containing input mzML files, OpenMS, SIRIUS and ZODIAC binaries and processing scripts, can be used to reconstruct (interim) results. To run the virtual machine use VirtualBox or VMware. The virtual Debian system credentials are: user ‘zodiac’ with password ‘zodiac’. For more details see the ‘README.txt’.

SIRIUS 4 evaluation data

To evaluate SIRIUS 4, we replicate the exact evaluation setup from the Dührkop et al. (PNAS 2015) paper, using the freely-available GNPS compounds, see also below. This zip file contains all mass spectra, as well as the cross-validation fold information, to allow an exact replication of the setup when evaluating other tools (data license CC0 1.0).
We also evaluate SIRIUS 4 using data from the CASMI 2016 challenge. We provide the data in SIRIUS readable file format: This zip file contains all mass spectra (MS1 and MS/MS, positive and negative ion mode) used in the evaluation (data license CC BY; authors Martin Krauss, Emma L. Schymanski, Cindy Weidauer, and Hubert Schupke; organizations UFZ and EAWAG). This data is also available from MassBank.

In two case studies SIRIUS 4 is able to identify compounds,  which remained unknown searching spectral libraries. From the American Gut project, N-3-OH-palmitoyl (m/z of 387.32) was identified from this LC/MS/MS run. From the clothing with antibacterial properties study, 14,15-Leukotriene E4 (m/z of 440.246) was identified from this LC/MS/MS run. All raw data from these studies is available from The American Gut project datasets are MSV000080186 and MSV00008018, the clothing study dataset is MSV000081379.

Maximum Colorful Subtree instances

Unfortunately, the instances require too much memory to make the downloadable (24 GByte compressed). It is possible to generate the instances “on the fly” from the mass spectrometry data, what require much less memory, but this is currently not supported by SIRIUS. If you want to have access to try your own algorithms, please contact Kai or Sebastian!


Kai Dührkop, Marie Anne Lataretu, W. Timothy J. White and Sebastian Böcker
Heuristic algorithms for the Maximum Colorful Subtree problem.
Technical report, arXiv, 2018.


This is the data used to evaluate the decoy database approach and FDR calculation.

GNPS: contains the raw GNPS files as well as decoy databases with ConditionalPeaks and RandomPeaks based on the raw data.

GNPS noise filtered: contains the noise-filtered GNPS files and trees as well as decoy databases with RandomPeaks, ConditionalPeaks and Reroot based on the noise filtered data

Massbank query: contains the Massbank Orbitrap query spectra.

Search results: contains an example of a search for Massbank spectra vs. noise filtered GNPS (MassbankOrbi-Gnps.txt), a search for Massbank spectra vs. a noise filtered decoy database (MassbankOrbi-GnpsDecoyConditionalPeaks.txt) as well as the q-values for the respective TDA (MassbankOrbi-Gnps_qValues_TDA_RandomPeaks.txt) and EBA (MassbankOrbi-GnpsDecoyConditionalPeaks.txt).

Uncommon element prediction

This is the data used in [1] to evaluate the classifiers for predicting the presence of uncommon elements in unknown biomolecules.

The myxo data set consists of 88 isotope patterns measured on a Bruker MaXis 2G qTOF spectrometer (Bremen, Germany). You can download the original spectra as well as the filtered isotope patterns which served as input for our method. The filtered spectra contain mass, intensity, resolution and FWHM information.

Further, we provide the simulated isotope patterns which were used for evaluation. Patterns were generated twice: with a standard noise level and with a high noise level. Files are named according to <classifier><pattern length><pos/neg>.txt, where pos/neg indicates, whether the file contains positive or negative examples for the respective classifier (not an ionization mode).


[1] M. Meusel, F. Hufsky, F. Panter, D. Krug, R. Müller, and S. Böcker
Predicting the presence of uncommon elements in unknown biomolecules from isotope patterns
Anal Chem,published ahead of print, 2016, DOI: 10.1021/acs.analchem.6b01015

CSI:FingerId evaluation data

These is the data used in [1] to evaluate the performance of tools for structure identification using structure databases and tandem MS. We used spectra from MassBank and GNPS for evaluation and training and PubChem for searching. Please consider that these databases are updated frequently. Therefore, we provide the IDs of the used spectra and compounds at the time of our evaluation.

PubChem candidate lists, Evaluation data IDs, crossvalidation IDs
Result lists for GNPS and MassBank known/novel

For MassBank and GNPS we also provide the spectra files used in our evaluation. We preprocessed this spectra as described in [1]. You can always refer to the GNPS and MassBank database for the original spectra. Please mention and cite MassBank and GNPS when using these data.

CSI:FingerId evaluation data

Here we provide the CASMI 2016 challenges in SIRIUS readable ms format:

CASMI 2016 data


[1]  Kai Dührkop, Huibin Shen, Marvin Meusel, Juho Rousu, and Sebastian Böcker.
Searching molecular structure databases with tandem mass spectra using CSI:FingerID
PNAS 2015 ; published ahead of print September 21, 2015, doi:10.1073/pnas.1509788112

BCD Beam Search evaluation data

The data we have used to evaluate the BCD Beam Search algorithm is available at


[1] Markus Fleischauer and Sebastian Böcker.
BCD Beam Search: Considering suboptimal partial solutions in Bad Clade Deletion supertrees.
in review

Bad Clade Deletion Supertrees evaluation data

The data we have used to evaluate BCD Supertrees.

Simulated  datasets  (alignments, input trees and model trees):
Results (supertrees, combined analysis trees) for simulated datasets:
Results and input data for biological datasets


[1] Markus Fleischauer and Sebastian Böcker.
Bad Clade Deletion Supertrees: A Fast and Accurate Supertree Algorithm.
Mol Biol Evol, 34:2408-2421, 2017

Greedy Strict Consensus Merger evaluation data

This is the data that we have used to evaluate the different GSCM scorings and implementations.

SMIDGenOG and SMIDGen  dataset including input and model trees:
GSCM results (supertrees) forSMIDGenOG and SMIDGen  dataset:
All scores, rates and stats as CSV:


[1] Markus Fleischauer and Sebastian Böcker.
Collecting reliable clades using the Greedy Strict Consensus Merger.
PeerJ (2016) 4:e2172

GC-MS EI Fragmentation Spectra

Rapid identification of small compounds from small amounts of substance is of interest in many areas of biology and medicine. Mass spectrometry (MS) coupled with gas chromatography (GC–MS) is a key technology for the identification of small molecules. We have presented a novel computational method for the de novo interpretation of high resolution EI fragmentation data of small molecules, that cannot be found in any, not even a structural, database. The resulting fragmentation trees explain relevant fragmentation reactions and assign molecular formulas to fragments. The method enables the identification of the molecular ion and the molecular formula of a metabolite. We evaluated the method on a selection of 50 derivatized and underivatized metabolites that can be downloaded here. For this dataset, please cite:


Franziska Hufsky, Martin Rempt, Florian Rasche, Georg Pohnert and Sebastian Böcker
De Novo Analysis of Electron Impact Mass Spectra Using Fragmentation Trees.
Anal Chim Acta, 739:67-76, 2012.

LC Tandem MS Fragmentation Spectra

In principle, tandem mass spectrometry allows us to identify “unknown” small molecules not in any database. Fragmentation trees have recently been introduced for the automated analysis of the fragmentation patterns of small molecules. We have presented a method for the automated comparison of such fragmentation patterns, based on aligning the compounds’ fragmentation trees. We clustered compounds based solely on their fragmentation patterns. We then presented a tool for searching a database for compounds with fragmentation pattern similar to an unknown sample compound.

The method has been evaluated on different datasets, many of which can be downloaded from MassBank: The Orbitrap dataset has accession numbers CE000001 to CE000694. Some of the spectra have previously been used for evaluation in [2]. If you use this dataset, please cite [2] (accession numbers CE000001 to CE000193) and [1] (accession numbers CE000194 to CE000694).

The QStar dataset has been uploaded to MassBank but, unfortunately, this has been done as part of a larger batch upload. To this end, the corresponding spectra can be found among accession numbers PB000001 to PB000999. If you use this dataset, please cite [2].

The MassBank dataset used in [1] has accession numbers PR100001 to PR101056.


[1] Florian Rasche, Kerstin Scheubert, Franziska Hufsky, Thomas Zichner, Marco Kai, Aleš Svatoš and Sebastian Böcker.
Identifying the unknowns by aligning fragmentation trees.
Anal Chem, 84(7):3417-3426, 2012.

[2] Florian Rasche, Aleš Svatoš, Ravi Kumar Maddula, Christoph Böttcher and Sebastian Böcker
Computing fragmentation trees from tandem mass spectrometry data.
Anal Chem, 83(4):1243-1251, 2011.

Gene Cluster

Genes occurring co-localized in multiple genomes can be strong indicators for either functional constraints on the genome organization or remnant ancestral gene order. This conserved patterns are usually referred to as gene clusters.
We have presented a method to efficiently compute gene clusters in hundreds of genomes and estimate the significance of the computed gene cluster predictions. To evaluate our software, we generated a dataset of 678 genomes from RefSeq. The contained genes were grouped into gene families based on clusters of orthologous groups(COG) and `non-supervised
orthologous groups(NOG) taken from the STRING database. The combined data used in [1] in the Gecko3 input format can be downloaded here. Computed gene clusters from [1] are available for default and relaxed parameters.


[1] Sascha Winter, Katharina Jahn, Stefanie Wehner, Leon Kuchenbecker, Jens Stoye, Manja Marz, Sebastian Böcker
Finding approximate gene clusters with Gecko3.
To be published.

[2] Katharina Jahn, Sascha Winter, Jens Stoye and Sebastian Böcker
Statistics for approximate gene clusters.
BMC Bioinformatics, 14(Suppl 15):S14, 2013. Proc. of RECOMB Satelite Workshop on Comparative Genomics (RECOMB-CG 2013).

Center Strings

Finding the center string is a classical computer science problem with important applications in computational biology. We focus on exact methods that are also swift in application and present an advanced preprocessing and a new iterative search strategy.
The data set used for evaluation can be downloaded here. For this dataset, please cite:


Franziska Hufsky, Léon Kuchenbecker, Katharina Jahn, Jens Stoye, and Sebastian Böcker.
Swiftly computing center strings.
BMC Bioinformatics, 12:106, 2011.

Cluster Editing evaluation data

This is the data that we have used to evaluate our Cluster Editing software.

Weighted graphs of size up to 100 derived from COG protein similarity data: (26 MB). Please cite [1, 2].

A set of randomly generated weighted graphs: (980 kB). Please cite [1-4].

A set of randomly generated unweighted graphs: (196 MB) and (186 MB). Please cite [5-6].

Input graphs after parameter-independent data reduction: (24 MB). Please cite [5-6].


[1] Sven Rahmann, Tobias Wittkop, Jan Baumbach, Marcel Martin, Anke Truss, and Sebastian Böcker.
Exact and Heuristic Algorithms for Weighted Cluster Editing.
Proc. of Computational Systems Bioinformatics (CSB 2007), volume 6, pages 391—401, 2007.

[2] Sebastian Böcker, Sebastian Briesemeister, Quang Bao Anh Bui, and Anke Truss.
A fixed-parameter approach for Weighted Cluster Editing.
Proc. of Asia-Pacific Bioinformatics Conference (APBC 2008), volume 5, pages 211—220 of Series on Advances in Bioinformatics and Computational Biology, Imperial College Press, 2008.

[3] Sebastian Böcker, Sebastian Briesemeister, Quang Bao Anh Bui, and Anke Truss
Going Weighted: Parameterized Algorithms for Cluster Editing.
Proc. of Conference on Combinatorial Optimization and Applications (COCOA 2008), volume 5165, pages 1—12 of Lect. Notes Comput. Sc., Springer, 2008.

[4] Sebastian Böcker, Sebastian Briesemeister, Quang Bao Anh Bui and Anke Truss
Going Weighted: Parameterized Algorithms for Cluster Editing.
Theor Comput Sci, 410(52):5467-5480, 2009.

[5] Sebastian Böcker, Sebastian Briesemeister, and Gunnar W. Klau
Exact Algorithms for Cluster Editing: Evaluation and Experiments.
Proc. of Workshop on Experimental Algorithms (WEA 2008), volume 5038, pages 289—302 of Lect. Notes Comput. Sc., Springer, 2008.

[6] Sebastian Böcker, Sebastian Briesemeister and Gunnar W. Klau
Exact Algorithms for Cluster Editing: Evaluation and Experiments.
Algorithmica, 60(2):316-334, 2011.

Faster mass decomposition evaluation data

This is the data that we have used to evaluate our mass decomposition algorithm. (Update 29.01.2015: Made separate files for each parameter configuration).

Orbitrap, Eawag, Hill, MassBank (2 MB). Please cite [1].


[1] Kai Dührkop, Marcus Ludwig, Marvin Meusel and Sebastian Böcker.
Faster mass decomposition.
In Proc. of Workshop on Algorithms in Bioinformatics (WABI 2013), volume 8126 of Lect Notes Comput Sci, pages 45-58. Springer, Berlin, 2013.