Classes for the masses: CANOPUS has appeared in Nature Biotechnology

Our article “Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra” has just appeared in Nature Biotechnology. Congrats to Kai and all co-authors!

In short: CANOPUS is a computational tool for systematic compound class annotation. It uses a deep neural network to predict 2,497 compound classes from fragmentation spectra, including all biologically relevant classes. From the machine learning perspective, the interesting part is that different levels of the neural network are trained using different data (heterogeneous training). CANOPUS explicitly targets compounds for which neither spectral nor structural reference data are available, and even predicts classes completely lacking tandem mass spectrometry training data. In evaluation using reference data, CANOPUS reached very high prediction performance (average accuracy of 99.7% in cross-validation) and outperformed four (rather advanced) baseline methods. We used CANOPUS to investigating the effect of microbial colonization in the mouse digestive system, for analyzing the chemodiversity of different Euphorbia plants, and for the structural elucidation of a novel marine natural product.

CANOPUS is already available to users through SIRIUS 4.5, which was released last Thursday. See also the designated CANOPUS page. A view-only version of the article is available here.

Full citation: K. Dührkop, L.-F. Nothias, M. Fleischauer, R. Reher, M. Ludwig, M. A. Hoffmann, D. Petras, W. H. Gerwick, J. Rousu, P. C. Dorrestein, and S. Böcker. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nat Biotechnol, 2020. https://doi.org/10.1038/s41587-020-0740-8

 

SIRIUS 4.5 released

We are happy to announce that a new version of SIRIUS is available. With that, CANOPUS now supports negative ion mode data. Additionally, we included more structure databases CSI:FingerID can search in, such as COCONUT (Sorokina & Steinbeck, 2020) and NORMAN (Brack et al., 2012). And in case an important database is missing: With the new version, you can import custom databases using the GUI.

Even more features:

  • All molecular structures have been standardized using the PubChem standardization service, to make structures more consistent. This update was already reported for version 4.4 but kept bugging us; it should now be solved for good. The standardization has a (small but measurable) positive impact on CSI:FingerID’s performance. More importantly, you will find fewer cases where CSI:FingerID is doing “something really strange”; this strange behavior was often due to un-standardized structures.
  • Breaking news: We renamed a few columns in the SIRIUS project space (see Changelog), to make column names more descriptive. Sorry about that; please make sure your downstream analysis is reading in the right columns.
  • CSI:FingerID now uses the molecular formula-specific Bayesian network scoring from our ISMB 2018 publication. Integrating this new score was a huge effort, but again has a positive impact on CSI:FingerID’s performance.
  • To allow for a smooth transition, you can continue to use SIRIUS 4.4 and the corresponding CSI:FingerID web service until November the 30th.
  • Please help us to make SIRIUS great again: Report bugs using the SIRIUS GitHub repository, or send an email to .

 

Qemistree has appeared in Nature Chemical Biology

Congratulations to Anupriya Tripathi from the group of Pieter Dorrestein: The article “Chemically informed analyses of metabolomics mass spectrometry data with Qemistree” has appeared in Nature Chemical Biology, and we are happy to be part of this research.

In short, Qemistree is a data exploration strategy based on the hierarchical organization of molecular fingerprints predicted from fragmentation spectra. Qemistree allows mass spectrometry data to be represented in the context of sample metadata and chemical ontologies.

Full citation: A. Tripathi, Y. Vázquez-Baeza, J. M. Gauglitz, M. Wang, K. Dührkop, M. Nothias-Esposito, D. D. Acharya, M. Ernst, J. J. J. van der Hooft, Q. Zhu, D. McDonald, A. D. Brejnrod, A. Gonzalez, J. Handelsman, M. Fleischauer, M. Ludwig, S. Böcker, L.-F. Nothias, R. Knight, and P. C. Dorrestein. Chemically informed analyses of metabolomics mass spectrometry data with Qemistree. Nat Chem Biol, 2020.

 

Lehre im Wintersemester 2020/21

Auch im Wintersemester hat uns Corona noch im Griff; deshalb werden die meisten Lehrveranstaltungen online erfolgen. Hier ein paar Details, was Sie erwartet (Achtung, diese news ist eine sticky note; wir werden sie aktualisieren, wenn es weitere Informationen gibt):

  • Einführung in die Bioinformatik I/1: Die Vorlesung macht Peter Dittrich; die Video-Dateien werden zum Download bereitgestellt und Sie können sich diese anhören/ansehen, wann es Ihnen passt. Wöchentlich am Dienstag von 10:15 bis 11:45 ist das Tutorium bei Sebastian Böcker; auch das wird online angeboten, ist aber live (Zoom-Veranstaltung). Die beiden Übungen sind parallel wöchentlich am Mittwoch von 14:15 bis 15:45. Das sind Präsenzveranstaltungen, und sie finden im Hörsaal 1 des Abbeanum und im Seminarraum 104 in der August-Bebel-Str. 4 statt. (Edit: Es sieht aktuell so aus, dass wir die Übungen als Präsenzveranstaltungen durchführen können.) Übungsleiter sind Marcus Ludwig und Emanuel Barth. Die Aufteilung auf die beiden Gruppen können wir im ersten Tutorium vornehmen. Die Veranstaltung startet mit dem Tutorium am Dienstag 3. November.
  • Algorithmische Massenspektrometrie: VL und Tutorium bei Sebastian Böcker. Die VL wird als Video-Dateien zum Download bereitgestellt, das Tutorium ist online aber live (Zoom-Veranstaltung), wöchentlich am Montag von 12:3o bis 14:oo. Die Übung ist wöchentlich am Donnerstag 12-14 Uhr, online aber live (Zoom-Veranstaltung), Übungsleiter ist Kai Dührkop. Die Veranstaltung startet mit dem Tutorium am Montag 2. November.
  • Currents in Bioinformatics: Das Seminar findet wöchentlich Dienstags 16:15 bis 17:45 statt (online aber live, Zoom-Veranstaltung). Ansprechpartner ist Fleming Kretschmer. Regelmäßige Teilnahme ist zwingend erforderlich. Wir lesen und besprechen hier aktuelle Forschungs-Paper. Die Veranstaltung startet am Dienstag 3. November.
  • Datamining und Sequenzanalyse: Die Veranstaltung wird von Markus Fleischauer gehalten und findet Mo 10:15-11:45h und  Fr 12:30-14:00h statt. Wir werden auch diese Veranstaltung online durchführen, sofern alle Studierenden einen geeigneten Computer zur Durchführung der Programmieraufgaben zur Verfügung haben. Ein Computer ist für das Praktikum geeignet, wenn er diese Anforderungen erfüllt. Alle für das Praktikum benötigte Software ist für Linux, Mac und Windows verfügbar und wird während des Praktikums gemeinsam installiert. Sollte jemand keinen geeigneten Computer organisieren können, bitte umgehend bei Markus Fleischauer melden. Wir werden Zoom verwenden. Der Link wird rechtzeitig mitgeteilt. Alle Informationen zur Veranstaltung werden hier zu finden sein. Start ist am Montag 02. November um 10:15 Uhr.

Bei individuellen Problemen und Fragen wenden Sie sich bitte via Email an Peter Dittrich (Studiengangsverantwortlicher) oder Sebastian Böcker. Sie können auch individuelle Gespräche via Zoom vereinbaren.

FSU-Disclaimer zum Vorlesungsmaterial, insbesondere zu den Videos mit den Vorlesungen:

  • In unseren Veranstaltung und ihrer Aufzeichnung wird ggf. urheberrechtlich geschütztes Material verwandt. Eine Nutzung, etwa durch Verbreitung oder Veröffentlichung dieses Materials, ist untersagt und kann die Geltendmachung von Unterlassungs- und Schadensersatzansprüchen zur Folge haben.

 

ZODIAC has appeared in Nature Machine Intelligence

Our article “Database-independent molecular formula annotation using Gibbs sampling through ZODIAC” has just appeared in Nature Machine Intelligence. Congrats to Marcus and all co-authors!

In short: Annotating the molecular formula of a small molecule is the first step towards its structural elucidation but remains highly challenging, particularly for “large compounds” above 500 Daltons. ZODIAC is a network-based algorithm for the de novo annotation (no database needed) of molecular formulas, and processes complete experimental LC-MS/MS runs. (No metabolite is an island.) In comparison to SIRIUS, previously best-of-class for this task, ZODIAC reduces the error rate of false annotations roughly to the half. And sometimes, much more…

If you have problems accessing the paper: Here is a read-only version

ZODIAC is already available to users through SIRIUS 4.4. See also the designated ZODIAC page.

Full citation: M. Ludwig, L.-F. Nothias, K. Dührkop, I. Koester, M. Fleischauer, M.A. Hoffmann, D. Petras, F. Vargas, M. Morsy, L. Aluwihare, P.C. Dorrestein, and S. Böcker. Database-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat Mach Intell 2:629–641, 2020.

Feature-Based Molecular Networking appeared in Nature Methods

Congratulations to Louis-Félix Nothias from the group of Pieter Dorrestein: The article “Feature-based molecular networking in the GNPS analysis environment” has appeared in Nature Methods, and we are happy to be part of this research. (We are lacking behind a little bit with our news.)

In short, FBMN introduces chromatography separation into the molecular networking workflow of GNPS; features with similar mass (and potentially similar MS/MS) but different retention time are now treated separately. See the article for details.

Full citation: L.-F. Nothias, D. Petras, R. Schmid, K. Dührkop, J. Rainer, A. Sarvepalli, I. Protsyuk, M. Ernst, H. Tsugawa, M. Fleischauer, F. Aicheler, A.A. Aksenov, O. Alka, P.-M. Allard, A. Barsch, X. Cachet, A.M. Caraballo-Rodriguez, R.R. Da Silva, T. Dang, N. Garg, J.M. Gauglitz, A. Gurevich, G. Isaac, A.K. Jarmusch, Z. Kameník, K.B. Kang, N. Kessler, I. Koester, A. Korf, A. Le Gouellec, M. Ludwig, C. Martin H., L.-I. McCall, J. McSayles, S.W. Meyer, H. Mohimani, M. Morsy, O. Moyne, S. Neumann, H. Neuweger, N.H. Nguyen, M. Nothias-Esposito, J. Paolini, V.V. Phelan, T. Pluskal, R.A. Quinn, S. Rogers, B. Shrestha, A. Tripathi, J.J.J. van der Hooft, F. Vargas, K.C. Weldon, M. Witting, H. Yang, Z. Zhang, F. Zubeil, O. Kohlbacher, S. Böcker, T. Alexandrov, N. Bandeira, M. Wang, and P.C. Dorrestein. Feature-based molecular networking in the GNPS analysis environment. Nat Methods 17(9):905–908, 2020.

Introducing CANOPUS for comprehensive compound class annotation

We are happy to introduce CANOPUS, a tool for the comprehensive annotation of compound classes from MS/MS data (certain restrictions apply, see below). In principle, CANOPUS is doing something similar as CSI:FingerID: Whereas CSI:FingerID can tell you what substructures are part of the query compound, CANOPUS does so for compound classes. The differences between both tasks are subtle but have massive consequences. See this preprint on the details of this difference, how CANOPUS works, how good it works etc.

At present, CANOPUS predicts 1270 compound classes. In more detail, CANOPUS predicts ClassyFire compound classes. ClassyFire is not the first but, to the best of our knowledge, by far the most comprehensive approach to assign classes solely from structure. (This last point is key, as this allows us to assign thousands of classes for millions of molecular structures.) Please have a look there if you use CANOPUS: Certain compound class definitions may be not what you expect. For example, we found that many phytosteroids are classified as bile acids in ClassyFire. While the biochemical origin of both classes is very different, they are structural very similar and, therefore, represented by the same class in the ClassyFire ontology.

You can download, install and use CANOPUS through SIRIUS 4.4. You will notice a new tab where you can access, for each compound, all compound classes it does or does not belong to (and, how sure we are about that). Fancier visualizations (see the preprint) will be made available with upcoming releases.

ps. Clearly, CANOPUS is comprehensive only within the limits of the LC-MS/MS technology: If a compound does not ionize, if no fragmentation spectrum is recorded in Data Dependent Acquisition, if a compound does not show any fragmentation, if multiple compounds are fragmented in a single spectrum etc, then CANOPUS cannot help you. We don’t do magic. Also, CANOPUS is limited by the available (structure and MS/MS) training data; but several years of thinking have been invested to get the most out of it.

SIRIUS 4.4 released

We are happy to announce that SIRIUS 4.4 is finally released. (Unfortunately, the MacOS version will have to wait a few more days.) There have been numerous changes and improvements, only few of which can be mentioned here.

Probably the biggest change is that SIRIUS 4.4 now reads mzML files (“centroided” data) and processes complete LC-MS/MS datasets. You can use ProteoWizard to transform your dataset to mzML. This does not only make things easier for you; it also allows SIRIUS to extract isotope patterns and adduct information more thoroughly from the MS1 data. SIRIUS 4.4 also supports multi-run datasets and aligns runs.

If you are using the graphical user interface (GUI) you no longer have to care about installing (the correct version of) Java. It is part of the installed SIRIUS software.

SIRIUS 4.4 uses the same project space for the command-line (CLI) and the GUI version, allowing you to use the SIRIUS GUI to browse through results computed with the CLI. The GUI also allows you to save your project and reload it later, including all previously computed results. Finally, you can export summary CSV and mzTab-M files for downstream analysis.

CSI:FingerID also had some updates:

  • Additional large molecular substructures: Have a look at the Fingerprint tab in the SIRIUS GUI, filter for large substructures.
  • Standardization of molecular structures (mesomerism, charge etc) through PubChem. This does not only improve identification statistics by a few percentage points, but also gets rid of certain cases where CSI:FingerID was doing “strange things”. Unfortunately, PubChem keeps changing the standardization without giving big notice, so some issues remain; but the current situation is definitely better than no standardization.

More stuff:

  • There is currently no version for MacOS; we are sorry. Somehow, MacOS does not like our multithreading. At present, we do not have access to a Mac for debugging, thanks to Corona.
  • Please report bugs using the SIRIUS GitHub repository or . There will be numerous such bugs, as SIRIUS 4.4 again carries major improvements and transformations under the hood. Help us to make SIRIUS better.
  • To allow for a smooth transition, you can continue to use SIRIUS 4.0.1 and the corresponding CSI:FingerID web service for a couple of weeks.
  • SIRIUS 4.4 integrates ZODIAC and CANOPUS, see the separate news.
  • passatutto is integrated into SIRIUS 4.4, allowing you to generate your own spectral library decoy database for FDR estimation.
  • We have included a beautiful interactive fragmentation tree viewer.
  • There may be a few more releases of SIRIUS 4.4.x that ship those things which are done in principle.
  • Finally, we have not reported the number of CSI:FingerID queries for some time, so here we go: There have been 47 million CSI:FingerID queries. (Plus a few million we lost through a little scripting bug. Our bad.) That is roughly one query every 1.5 seconds since we reported one million queries in Feb 2018.

 

SIRIUS 4.4 beta released

Some of you may have noticed that yesterday, April 17, the SIRIUS 4.4 beta has been released. This update is huge so we are particularly careful not to break too many things. (We will definitely break some things so please report bugs using the SIRIUS GitHub repository or .) Some facts of what you can expect:

  • The official SIRIUS 4.4 release will happen in a few days.
  • Even after SIRIUS 4.4 has been officially deployed, you can continue to use SIRIUS 4.0.1 and the corresponding CSI:FingerID web service. We hope that this allows for a smooth transition.
  • SIRIUS 4.4 integrates ZODIAC for better molecular formulas.
  • SIRIUS 4.4 integrates CANOPUS for compound class assignments.
  • SIRIUS 4.4 now reads mzML files (“centroided” data) and processes complete LC-MS/MS datasets.
  • CSI:FingerID had some massive updates, including more and larger molecular properties and standardization of molecular structures.
  • SIRIUS 4.4 also supports multi-run datasets and aligns runs.
  • SIRIUS 4.4 uses the same project space for the command-line and the user interface version, allowing you to use the SIRIUS GUI to browse through results computed with the CLI.
  • passatutto is integrated into SIRIUS 4.4, allowing you to generate your own spectral library decoy database for FDR estimation.
  • If you wonder why we jump from version 4.0.1 to 4.4: There have been several internal releases in between.
  • A word of warning: Many features and changes have accumulated and there will be a few more releases (4.4.x) until the quiver is empty. For example, the structure database will change again as we have massive issues with the way PubChem handles structure standardization.

IMPRS application call for PhD students

The International Max Planck Research School at the Max Planck Institute for Chemical Ecology in Jena is looking for PhD students. One of the projects is from our group on “making SIRIUS and CSI:FingerID GCMS-ready”. Deadline is May 08, 2020.

SIRIUS and CSI:FingerID are the best-of-class tools for MS-based compound identification in metabolomics, natural products and related fields. More than one million compound queries have been submitted to our web service, from over 3000 users and 47 countries. See our recent publication in Nature Methods (Dührkop et al., 2019).

Currently, our tools can only process tandem mass spectrometry data; extending them to Gas Chromatography Electron Ionization appears natural, but comes with numerous challenging problems from algorithmics and machine learning. This will be done in cooperation with the group of Georg Pohnert, see his recent publication in Nature (Thume et al., 2018).

We are searching for motivated candidates from bioinformatics, machine learning, cheminformatics and/or computer science who want to work in this exciting, quickly evolving interdisciplinary field. Please contact Sebastian Böcker in case of questions.

Half a position is being paid by the IMPRS; this will be supplemented by funding from our chair to 2/3 TV-L E13. (Note that the cost of living in East Germany is still considerably lower than in West Germany.) Jena is a beautiful city and wine is grown in the region: https://www.youtube.com/watch?v=DQPafhqkabc.

IMPRS: http://imprs.ice.mpg.de/
MPI-CE: http://www.ice.mpg.de/
SIRIUS & CSI:FingerID: https://bio.informatik.uni-jena.de/software/sirius/
Literature: https://bio.informatik.uni-jena.de/publications/ and https://bio.informatik.uni-jena.de/textbook-algoms/

Jena: https://www.google.de/search?q=jena&tbm=isch&
https://www.study-in.de/en/discover-germany/german-cities/jena_26976.php
https://www.google.com/search?q=jena&tbm=isch

SIRIUS 4.4 is coming soon!

It’s been a while since SIRIUS 4 received its last update. We are excited to announce that SIRIUS 4.4 is coming soon.
It comes with many new features, e.g.:

  • Project-Space: A standardized persistence layer shared by CLI and GUI that makes both fully compatible.
  • Redesigned Command Line Interface: SIRIUS is now a toolbox that contains different sub-tools that can be combined to “tool-chains”.
  • New (and newly integrated) tools:
    • ZODIAC: Improve Molecular Formula Identifications by re-ranking SIRIUS molecular formula annotations using Bayesian statistics. ZODIAC optimizes annotations on a whole dataset taking advantage of the fact that compounds usually co-occur in a network of derivatives.
    • PASSATUTTO: Is now part of SIRIUS and allows you to generate dataset specific decoy spectral libraries from computed fragmentation trees.
    • lcms-align: SIRIUS supports mzML/mzXML format to process whole LC-MS/MS runs. The lcms-align preprocessing tool performs feature detection and feature alignment based on the available MS/MS spectra.
    • Other handy standalone tools, e.g. compound similarity calculation.

To provide user friendly but also flexible and customizable access to the different tools we completely redesigned the command line interface (CLI).
We know that this might break your workflows and therefore we provide you an early access version of the CLI that can be used for testing and adapting your workflows:
https://bio.informatik.uni-jena.de/repository/list/dist-snapshot-local/de/unijena/bioinf/ms/sirius/4.4.0-SNAPSHOT/
You will also find an updated version of the manual which is still work-in-progress but contains already an updated section on the new CLI.

No worries, even when SIRIUS 4.4. will be released (as soon as the GUI is ready) version 4.0.1 will still be available for some time.

If you find bugs or have any feedback feel free to open an issue on the SIRIUS GitHub repository or contact us via .

Preprint of ZODIAC now on bioRxiv

A preprint of our paper “ZODIAC: database-independent molecular formula annotation using Gibbs sampling reveals unknown small molecules.” is now available: https://doi.org/10.1101/842740

ZODIAC takes advantage of the fact that an organism produces related metabolites. ZODIAC builds upon SIRIUS and reranks molecular formula candidates, optimizing annotations on whole datasets. By applying ZODIAC to multiple datasets we greatly increased the number of correct annotations and identified novel molecular formulas which are not present even present in PubChem.

ZODIAC will be made available in an upcoming release of the SIRIUS software.

 

 

Meet us at GCB 2019

Sebastian, Kai, Martin and Marcus are attending the German Conference on Bioinformatics in Heidelberg. We look forward to a great conference.

New version of Lecture Notes on Algorithmic MS

I have just uploaded a new version (0.8.3) of the Lecture Notes on Algorithmic Mass Spectrometry. As expected, I did not have too much time to work on it (them?) during lecture time, which is luckily over now. It is a lot of small improvements. Also, Magnus Palmblad was so kind and had an expert look through the isotope pattern sections. Unfortunately, the stuff that was missing from the previous version, is still missing now…

Meet us at ISMB 2019

Meet Markus at the ISMB/ECCB 2019 in Basel.

On Tuesday, Markus will give a talk about “SIRIUS 4: turning tandem mass spectra into metabolite structure information”.
There is also a corresponding poster in Session A (J-06) which will be presented on Tuesday 6:00pm-8:00pm.

Meet us at Metabolomics 2019

Meet Marcus and Sebastian at the conference of the Metabolomics Society 2019.

On Monday and Tuesday, Marcus will present a poster (539) about SIRIUS 4 and turning tandem mass spectra into metabolite structure information.

DFG project on retention time/order prediction granted

The Deutsche Forschungsgemeinschaft has granted a project on retention time and order prediction for liquid chromatography. This is a joint project with Michael Witting, Helmholtz Zentrum München.

The idea of the project is to integrate retention times from liquid chromatography into the SIRIUS/CSI:FingerID identification pipeline. Literally hundreds of papers have been published on the topic of retention time prediction, but all of them fail to provide predictions that are transferable across chromatography conditions and compound classes; see Héberger’s review (Journal of Chromatography A, 2007) where he speaks rather frankly about the malpractices of publishing such RT-prediction methods. On the other hand, retention times can indeed be used to further boost CSI:FingerID’s identification performance. Also, transferable retention prediction is not impossible, as we have shown here. The trick is not to try to predict retention time (which is extremely dependent on instrument parameters etc) but rather retention order.

We are searching for a qualified and motivated PhD student who wants to accept this challenge. (S)he should be knowledgeable in machine learning and preferably also bioinformatics in general; biochemistry knowledge is clearly also a plus. We believe that this can be the next big thing to further push CSI:FingerID’s performance. Please contact Sebastian or Kathrin in case you are interested and qualified.