SIRIUS 4 End Of Life

After more than 4 successful years and over 300 million queries to our web services, SIRIUS 4 will reach its end of life on 31th of December 2022.

What does this mean for you? We will shut down the web services behind SIRIUS 4, so fingerprint prediction, structure database search and compound class prediction will not be possible with SIRIUS 4 anymore. Switching to SIRIUS 5 will solve the problem for you. As always, if you have a running long term project that exceeds the end of life date and cannot switch to SIRIUS 5 please contact us.  

We are part of the EU HUMAN doctoral network

We are happy to announce that we are part of the EU HUMAN doctoral network, funded by EU Horizon. The topic of the network is “Harmonising and Unifying Blood Metabolomic Analysis Networks“. In the very near future, we will officially advertise the doctoral positions that are part of this network on our website. My group is looking for a PhD student from bioinformatics, computer science or cheminformatics for the computational analysis of mass spectrometry data. The student is expected to have experience with and interest in the development and evaluation of computational methods and machine learning models. If you are interested, please contact Sebastian.

HUMAN is a multidisciplinary consortium that focuses on topics such as cross-laboratory comparisons, integration, co-evaluation, and proofing of liquid chromatography mass spectrometry for the analysis of human blood.

 

SIRIUS email account verification failed, what now?

Dear SIRIUS users,
when creating a new SIRIUS account the link checkers of some email tools seem to execute the verification link before the user can click the link manually. In such cases the link will already be used (invalid) when the user is clicking it manually and the server returns an error message. In such cases the account has already been verified successfully, regardless of the error message. Just ignore the error and try login into SIRIUS. 

We are currently working on a solution to ensure that the verification can only be performed by the user itself.

MSNovelist has appeared in Nature Methods

Congratulations to Michael “Michele” Stravs from the group of Nicola Zamboni at ETH Zürich: The article “MSNovelist: De novo structure generation from mass spectra” has appeared in Nature Methods, and we are thrilled to be part of this research.

MSNovelist workflow
MSNovelist transforms a MS/MS into a molecular fingerprint via the CSI:FingerID fingerprint prediction, then uses this fingerprint to generate molecular structures. Fig. by Michele Stravs, adapted from the MSNovelist Nature Methods paper.

In short, MSNovelist is a computational method that transforms the tandem mass spectrum of a small molecule into its molecular structure. Full stop. That sounds shocking and surprising, and in fact, I view this as the “final frontier” in small molecule mass spectrometry. It is understood that “certain restrictions apply”, as they say. But Michele has undertaken an huge amount amount of work to clearly show what the method can do and what it cannot; for example, an in-depth evaluation against a method that basically ignores MS/MS data when generating structures. That methods works disturbingly well; but if you think about it for some time, it becomes clear why: I just say, “blockbuster metabolites“. Michele has written a very nice blog post where he explains in much detail what MSNovelist is all about. If I had to recap the method in one sentence, then this is it: MSNovelist gives you a head start for the de novo elucidation of a novel structure.

We will make MSNovelist available through SIRIUS in an upcoming release — hopefully, soon.

ps. Also read the article “Some assembly required” by Corey D. Broeckling.

Full citation: M. A. Stravs, K. Dührkop, S. Böcker, and Nicola Zamboni. MSNovelist: De novo structure generation from mass spectra. Nature Methods, 2022. https://doi.org/10.1038/s41592-022-01486-3

SIRIUS 5 is released!

We are happy to announce that a major version upgrade of SIRIUS is available! Scroll to the bottom to get a visual impression of the changes.

SIRIUS 5 now includes the following new features and improvements:

  • Lipid class annotation with El Gordo: Lipid structures that have the same molecular formula (usually belonging to the same lipid class) can be extremely similar to each other, often only differing in the position of the double bonds. These extremely similar structures may be even not differentiable by mass spectrometry at all. SIRIUS 5 now predicts the lipid class from the spectrum; when performing CSI:FingerID database search, all structure candidates that belong to this lipid class are tagged.
  • Sub-structure annotation with Epimetheus: For experimentalists, CSI:FingerID search may seem like a black box. If you want to perform manual validation of CSI:FingerID structure candidates, Epimetheus now provides a direct connection between structure candidates and your input MS/MS spectra. Sub-structures of the structure candidate of your choice are generated by a combinatorial fragmenter and assigned to peaks in the MS/MS spectrum. The new Epimetheus view allows you to directly visualize and inspect these sub-structure annotations.
  • Feature-rich spectrum viewer: Improved functionality of the SIRIUS spectrum viewer, including mirror plots of measured vs predicted isotope pattern.
  • New LC-MS view: A new view in SIRIUS 5 that shows the extracted-ion chromatogram of a compound, including its detected adducts and isotopes. A traffic light allows for quick spectral quality assessment.
  • CANOPUS now fully supports Natural Product Classes (NPC).
  • Advanced filtering options on the compound list, including the new lipid class annotations.
  • Support for additional spectrum file formats (.msp, massbank, .mat)

If you were previously using SIRIUS 4, please be aware of the following breaking changes:

  • User authentication: A user account and license is now needed to use the online features of SIRIUS. The license is free and automatically available for non-commercial use. If your account is not automatically verified because your non-commercial research institution is not whitelisted yet, please contact us at
  • New project space compression: Method level directories are now compressed archives to reduce number of files and save storage.
  • Changed summary writing: Summary writing has been made a separate sub-tool (write-summaries). Summary files format has slightly changed. 
  • Prediction / DB search split: The fingerid/structure sub-tool has been split into a fingerprint (fingerprint prediction) and a structure (structure db search) sub-tool. This allows the user to recompute the database search without having to recompute the fingerprint and compound class predictions. It further allows to compute CANOPUS compound class prediction without having to perform structure db search.
  • Updated fingerprints: Updated fingerprint vector. Fingerprint related results of SIRIUS 4 projects may have to be recomputed to perform certain analysis steps (e.g. recompute db-search). Reading the projects is still possible and formula results are not affected.
  • Custom database format change: Custom database format has changed. Custom databases need to be re-imported.
  • GUI column rename: Some views in the GUI have been renamed to better reflect their position and role in the workflow.

For a quick overview on these new features and changes, visit our YouTube channel. Please refer to our online documentation for a more comprehensive overview and help us squash all remaining bugs by contributing at our GitHub!

 

 

Chair is awarded Thuringian Research Prize 2022 for applied research

Today (06 April), Prof. Sebastian Böcker, Dr. Kai Dührkop, Dr. Markus Fleischauer, Dr. Marcus Ludwig and Martin Hoffmann were awarded the Thuringian Research Prize 2022 for applied research. This was announced by Thuringia’s Science Minister Wolfgang Tiefensee in a video presentation. The price recognizes the development of machine learning methods for identifying small molecules, including CSI:FingerID, COSMIC and CANOPUS. These methods can be used via our software SIRIUS.

We feel very honored.

You can find the official announcement here.
The press release from Friedrich Schiller University Jena is here.

Thuringian Research Award. Image: Jens Meyer (University of Jena)
Group photo. Prof. Sebastian Böcker holds the prize in his hands. Image: Jens Meyer (University of Jena)

Why we do not use metascores

…and why you should also be very careful when doing so

Hi all, I (Sebastian) have recorded a talk about metascores which is now available from our YouTube channel at https://www.youtube.com/watch?v=mkfG6-ZqD0s. With “metascores”, I mean scores that are not based on the actual data (or metadata!) but rather on side information such as citation counts or production volumes of metabolites. See below for the distinction between metascores and metadata.

I have been thinking about recording such a talk for several years now. I never did, partly because I hoped that this topic would “go away” without me doing such a video. I was wrong, metascores are still in much use today. The other reason not recording the talk was that the more I thought about metascores, the more problems came into my mind. So, I added more slides to the talk, and then I had to re-record the talk, and so on ad infinitum. I now present six problems in the video; I decided I better record it before a seventh problem pops up.

I want to make clear that there is nothing bad with metascores as long as you are using them for a confined application: That is, you want to identify one particular feature in your LC-MS run, and for that you need some candidate compounds to get things started. If this is what you are after, and the actual identification is performed by an independent method (say, buying a commercial standard and doing a spike-in experiment) then you can generate the sorted list of candidates by any method that suits you; that clearly includes metascores. But as soon as you are doing “untargeted metabolomics” or anything similar to that, and as soon as you are using annotations of an in silico method to derive downstream information, you are in trouble — as explained in the video.

I discuss six problems of metascores in the talk, and I thought I will also shortly discuss them here. But first, let us discuss metascores vs. metadata.

Metascores vs. metadata

I previously had some discussions about metascores, and I have come to believe that some people think highly of metascores because of the connection to metadata. Well, point is, this is merely a misunderstanding. Metascores and metadata have nothing in common but the prefix “meta”. Metadata is data about your data; it is already used by in silico methods, be it the mass accuracy of the measurement or the ion mode. Metascores — at least the ones I am aware of — use side information, information which has nothing to do with the actual experiment you are conducting. See here for details. Side note: Using such side information (priors) has been discussed repeatedly in other fields such as transcriptomics or proteomics, but has been abandoned everywhere else many years ago.

1st problem: Blockbuster metabolites

This is potentially the biggest single issue of metascores: You will annotate the same metabolites again and again. They are simply “so much cooler” than everything else that a method can basically ignore the data. Who will not love to watch another blockbuster movie? And who will not love to annotate another blockbuster metabolite? See here for details.

2nd problem: Evaluation results are misleading

This is not so much a problem of metascores, but one that is caused by the interplay of metascores and the data we use for evaluations. In short, do not trust evaluations of metascores; the data used for evaluating them are basically from blockbuster metabolites. Which metascores will then correctly annotate, because they love to annotate blockbuster metabolites, and only blockbuster metabolites. See here for details.

3rd problem: Obfuscating good search results

When I say that metascore methods can basically ignore the MS/MS data, this is not as good as it may sound. These methods will obfuscate high-quality search results of an in silico method, and make it impossible for you to decide whether or not a particular search result is worth to follow up on. This issue gets dramatic if you use annotations to generate, say, statistics about the sample. In short: Never do any further analysis on annotations when a metascore was in play. See here for details.

4th problem: Why are you using MS/MS anyways?

It turns out that using a metascore, you can actually forget about MS/MS data; in evaluations, this data are no longer needed to reach good annotation rates. Isn’t that great news: We can do untargeted metabolomics and get away with LC-MS data, saving ourselves the troubles of recording MS/MS data at all! A classical win-win situation: Faster measurements and untargeted metabolomics. Citing Leonard Hofstadter: “Our babies will be smart and beautiful.” See here for details.

5th problem: You are not searching where you think you are

This problem makes me nervous, personally. We are basically saying we are searching throughout the whole planet Earth when in fact, we are searching only in our apartment. I doubt that I can get across the implications of doing so; but this is a horror for reproducibility, method disclosure etc. See here for details.

6th problem: Overfitting

But citations are a reasonable feature for compound annotation, right? And, metascores using citation numbers improve search results, right? Doesn’t that mean something? Short answer: No. We can also reach excellent search results with a metascore that is using moonstruck features such as “number of consonants in the PubChem synonyms”. See here for details.

I also have a few suggestions how I would proceed, instead of using a metascore. I am convinced that these suggestions are not the final word; rather, they are meant as a starting point.

Hope this talk helps to clear the perception of this particular computational method. Best regards, Sebastian.

COSMIC has appeared in Nature Biotechnology

Our article “High-confidence structural annotation of metabolites absent from spectral libraries” has just appeared in Nature Biotechnology. Congrats to Martin and all co-authors!

In short, COSMIC allows you to assign confidence to structure annotations. For every structure annotated by CSI:FingerID, COSMIC provides a confidence score (a number between 0 and 1) that tells you how likely it is that this annotation is correct. This is similar in spirit to what is done in spectral library search: Not only is the cosine score used to decide which candidate best fits to the query spectrum; in addition, we use the cosine of the top-scoring candidate (the hit) to decide whether it is likely correct (say, above 0.8), incorrect (say, below 0.6) or in the “twilight” in-between. If you have been using CSI:FingerID for some time, you might have noticed that finding such thresholds is not possible for the CSI:FingerID score. COSMIC closes this gap and tells you if an annotation is likely correct or incorrect.

Doing so is undoubtedly convenient in practice; but this is not what COSMIC is all about. What we can do now is to sort all hits in a dataset or even a repository with respect to confidence, and then concentrate our downstream analysis on high-confidence annotations. Next, we can replace the “usual” structure databases we search in by a structure database made entirely from hypothetical structures generated by combinatorics, machine learning or in silico enzymatic reactions.

We demonstrate COSMIC’s power by generating a database of hypothetical bile acid structures, combinatorially adding amino acids to bile acid cores, yielding 28,630 plausible bile acid conjugate structures. We then searched query MS/MS data from a mice fecal dataset in this combinatorial database, and used the COSMIC confidence score to distinguish between hits that are likely correct or incorrect. We manually evaluated the top 12 hits and found that 11 annotation (91.6%) were likely correct; two annotations were further confirmed using synthetic standards. All 11 bile acid conjugates are “truly novel”, meaning that we could not find those structures in PubChem or any other structure database (or publication). Whereas reporting 11 novel bile acid conjugates may appear rather cool, we argue it is even cooler that we did this without a biological hypothesis beyond “there might be bile acid conjugates out there which nobody knows about”; and that COSMIC found the top bile acid conjugate annotations in a fully automated manner and in in a matter of hours.

We have also annotated 2,666 LC-MS/MS runs from human samples with molecular structures which are currently absent from HMDB, and for which no MS/MS reference data are available; and finally, 17,414 LC-MS/MS runs with annotations for which no MS/MS reference data are available. We hope that some of them might be of interest to you.

If you have an idea of hypothetical structures, similar to the bile acid conjugates, to be searched against thousands of datasets, please let us know.

COSMIC’s confidence score is available through SIRIUS since version 4.8, download here.

 

Happy 25 million queries, CANOPUS!

We are fully aware that this post is far less interesting to you than it is to us; but sometimes, proud parents just have to do what proud parents have to do: CANOPUS has passed 25 million queries! Congratulations! Wow, that was fast, the preprint appeared on bioRxiv only 14 months ago.

In this context, we can also report that CSI:FingerID has surpassed 120 million queries. Which basically means we missed the round anniversary. We are bad parents; but kids are sometimes growing so quickly, you turn around and they are past 100 million queries.

Have fun with our tools!

 

SIRIUS 4.0.1 End Of Life

After 2 and a half successful years and over 35 million predicted fingerprints, SIRIUS 4.0.1 will reach its end of life on Friday the 30th of April 2021.
What does this mean for you? We will shut down the web service for CSI:FingerID, so no fingerprint prediction and structure database search will be possible with SIRIUS 4.0.1 anymore.

SIRIUS 4.7.0 Released

We are happy to announce that SIRIUS 4.7.0 is now available for download . This release is all about fixing bugs and performance optimization. To all who had problems with the ILP solvers, a freezing GUI, high memory consumption or long running times: This update should make your life way easier. For a full list of changes see the Changelog.

We further integrated the option to compute fragmentation trees only with our heuristic algorithm (no ILP involved) to speedup molecular formula identification for high mass compounds.
Together with applying timeouts on compound level this should make the processing of large datasets much more feasible.

 

SIRIUS screener
SIRIUS 4.7.0

Introducing COSMIC to assign confidence to annotations

We are happy to introduce COSMIC, a tool for that allows you to assign confidence to structure annotations. For every structure annotated by CSI:FingerID, COSMIC provides a confidence score (a number between 0 and 1) that tells you how likely it is that this annotation is correct. This is similar in spirit to what is done in spectral library search: Not only is the cosine score used to decide which candidate best fits to the query spectrum; in addition, we use the cosine of the top-scoring candidate (the hit) to decide whether it is likely correct (say, above 0.8), incorrect (say, below 0.6) or in the “twilight” in-between. If you have been using CSI:FingerID for some time, you might have noticed that finding such thresholds is not possible for the CSI:FingerID score. COSMIC closes this gap and tells you if an annotation is likely correct or incorrect.

Deciding whether a certain CSI:FingerID hit is correct or incorrect, is undoubtedly convenient in practice. But this is not what COSMIC is all about. What we can do now is to sort all hits in a dataset or even a repository with respect to confidence, and then concentrate our downstream analysis on high-confidence annotations. Next, we can replace the “usual” structure databases we search in by a structure database made entirely from hypothetical structures generated by combinatorics, machine learning or in silico enzymatic reactions.

We demonstrate COSMIC’s power by generating a database of hypothetical bile acid structures, combinatorially adding amino acids to bile acid cores, yielding 28,630 plausible bile acid conjugate structures. We then searched query MS/MS data from a mice fecal dataset in this combinatorial database, and used the COSMIC confidence score to distinguish between hits that are likely correct or incorrect. We manually evaluated the top 12 hits and found that 11 annotation (91.6%) were likely correct; two annotations were further confirmed using synthetic standards. All 11 bile acid conjugates are “truly novel”, meaning that we could not find those structures in PubChem or any other structure database (or publication). Whereas reporting 11 novel bile acid conjugates may appear rather cool, we argue it is even cooler that we did this without a biological hypothesis beyond “there might be bile acid conjugates out there which nobody knows about”; and that COSMIC found the top bile acid conjugate annotations in a fully automated manner.

We have also annotated 2,666 LC-MS/MS runs from human samples with molecular structures which are currently absent from HMDB, and for which no MS/MS reference data are available; and finally, 17,414 LC-MS/MS runs with annotations for which no MS/MS reference data are available. We hope that some of them might be of interest to you.

See the COSMIC preprint for details, and the COSMIC web page for further information. Update: COSMIC is now available through SIRIUS 4.8 and above, download here.

                                                                      

 

Video Behind the Scenes: CSI:FingerID

There is a new video available and it is finally explaining CSI:FingerID in much detail — possibly too much detail, the video is more than 2 hours. Covers everything from general thoughts and considerations about in silico methods and methods evaluation, to the details of molecular fingerprints, FingerID and, finally, CSI:FingerID. I am sorry for the bad audio quality, still using my build-in laptop mic.

 

SIRIUS online documentation now available!

We are happy to announce that the new online documentation for SIRIUS is now available at https://boecker-lab.github.io/docs.sirius.github.io/.

The content is completely written in Markdown which makes contributions by the community very easy. No programming skills required!
Help us with your contributions to make this documentation more comprehensive and useful for the community. See our GitHub repository for 
detailed information on how to contribute.

Classes for the masses: CANOPUS has appeared in Nature Biotechnology

Our article “Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra” has just appeared in Nature Biotechnology. Congrats to Kai and all co-authors!

In short: CANOPUS is a computational tool for systematic compound class annotation. It uses a deep neural network to predict 2,497 compound classes from fragmentation spectra, including all biologically relevant classes. From the machine learning perspective, the interesting part is that different levels of the neural network are trained using different data (heterogeneous training). CANOPUS explicitly targets compounds for which neither spectral nor structural reference data are available, and even predicts classes completely lacking tandem mass spectrometry training data. In evaluation using reference data, CANOPUS reached very high prediction performance (average accuracy of 99.7% in cross-validation) and outperformed four (rather advanced) baseline methods. We used CANOPUS to investigating the effect of microbial colonization in the mouse digestive system, for analyzing the chemodiversity of different Euphorbia plants, and for the structural elucidation of a novel marine natural product.

CANOPUS is already available to users through SIRIUS 4.5, which was released last Thursday. See also the designated CANOPUS page. A view-only version of the article is available here.

Full citation: K. Dührkop, L.-F. Nothias, M. Fleischauer, R. Reher, M. Ludwig, M. A. Hoffmann, D. Petras, W. H. Gerwick, J. Rousu, P. C. Dorrestein, and S. Böcker. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nat Biotechnol, 2020. https://doi.org/10.1038/s41587-020-0740-8

 

SIRIUS 4.5 released

We are happy to announce that a new version of SIRIUS is available. With that, CANOPUS now supports negative ion mode data. Additionally, we included more structure databases CSI:FingerID can search in, such as COCONUT (Sorokina & Steinbeck, 2020) and NORMAN (Brack et al., 2012). And in case an important database is missing: With the new version, you can import custom databases using the GUI.

Even more features:

  • All molecular structures have been standardized using the PubChem standardization service, to make structures more consistent. This update was already reported for version 4.4 but kept bugging us; it should now be solved for good. The standardization has a (small but measurable) positive impact on CSI:FingerID’s performance. More importantly, you will find fewer cases where CSI:FingerID is doing “something really strange”; this strange behavior was often due to un-standardized structures.
  • Breaking news: We renamed a few columns in the SIRIUS project space (see Changelog), to make column names more descriptive. Sorry about that; please make sure your downstream analysis is reading in the right columns.
  • CSI:FingerID now uses the molecular formula-specific Bayesian network scoring from our ISMB 2018 publication. Integrating this new score was a huge effort, but again has a positive impact on CSI:FingerID’s performance.
  • To allow for a smooth transition, you can continue to use SIRIUS 4.4 and the corresponding CSI:FingerID web service until November the 30th.
  • Please help us to make SIRIUS great again: Report bugs using the SIRIUS GitHub repository, or send an email to .

 

Qemistree has appeared in Nature Chemical Biology

Congratulations to Anupriya Tripathi from the group of Pieter Dorrestein: The article “Chemically informed analyses of metabolomics mass spectrometry data with Qemistree” has appeared in Nature Chemical Biology, and we are happy to be part of this research.

In short, Qemistree is a data exploration strategy based on the hierarchical organization of molecular fingerprints predicted from fragmentation spectra. Qemistree allows mass spectrometry data to be represented in the context of sample metadata and chemical ontologies.

Full citation: A. Tripathi, Y. Vázquez-Baeza, J. M. Gauglitz, M. Wang, K. Dührkop, M. Nothias-Esposito, D. D. Acharya, M. Ernst, J. J. J. van der Hooft, Q. Zhu, D. McDonald, A. D. Brejnrod, A. Gonzalez, J. Handelsman, M. Fleischauer, M. Ludwig, S. Böcker, L.-F. Nothias, R. Knight, and P. C. Dorrestein. Chemically informed analyses of metabolomics mass spectrometry data with Qemistree. Nat Chem Biol, 2020.