News | Lehrstuhl Bioinformatik Jena

False Discovery Rates for metabolite annotation: Why accurate estimation is impossible, and why this should not stop us from trying

01/07/2024 by Sebastian Böcker

I recently gave a talk at the conference of the Metabolomics society. After the talk, Oliver Fiehn asked, “when will we finally have FDR estimation in metabolite annotation?” This is a good question: False Discovery Rates (FDR) and FDR estimation have been tremendously helpful in genomics, transcriptomics and proteomics. In particular, FDRs have been extremely helpful for peptide annotation in shotgun proteomics. Hence, when will FDR estimation finally become a reality for metabolite annotation from tandem mass spectrometry (MS/MS) data? After all, the experimental setup of shotgun proteomics and untargeted metabolomics using Liquid Chromatography (LC) for separation, are highly similar. By the way, if I talk about metabolites below, this is meant to also include other small molecules of biological interest. From the computational method standpoint, they are indistinguishable.

My answer to this question was one word: “Never.” I was about to give a detailed explanation for this one-word answer, but before I could say a second word, the session chair said, “it is always good to close the session on a high note, so let us stop here!” and ended the session. (Well played, Tim Ebbels.) But this is not how I wanted to answer this question! In fact, I have been thinking about this question for several years now; as noted above, I believe that it is an excellent question. Hence, it might be time share my thoughts on the topic. Happy to hear yours! I will start off with the basics; if you are familiar with the concepts, you can jump to the last five paragraphs of this text.

The question that we want to answer is as follows: We are given a bag of metabolite annotations from an LC-MS/MS run. For every query spectrum, this is the best fit (best-scoring candidate) from the searched database, and will be called “hit” in the following. Can we estimate the fraction of hits that are incorrect? Can we do so for a subset of hits? More precisely, we will sort annotations by some score, such as the cosine similarity for spectral library search. If we only accept those hits with score above (or below, it does not matter) a certain threshold, can we estimate the ratio of incorrect hits for this particular subset? In practice, the user selects an arbitrary FDR threshold (say, one percent), and we then find the smallest score threshold so that the hits with score above the score threshold have an estimated FDR below or equal to the FDR threshold. (Yes, there are two thresholds, but we can handle that.)

Let us start with the basic definition. For a given set of annotations, the False Discovery Rate (FDR) is the number of incorrect annotations, divided by the total number of annotations. FDR is usually recorded as a percentage. To compute FDR, you have to have complete knowledge; you can only compute it if you know upfront which annotations are correct and which are incorrect. Sad but true. This value is often referred to as “exact FDR” or “true FDR”, to distinguish it from the estimates we want to determine below. Obviously, you can compute exact FDR for metabolite annotations, too; the downside is that you need complete knowledge. Hence, this insight is basically useless, unless your are some demigod or all-knowing demon. For us puny humans, we do not know upfront which annotations are correct and which are incorrect. The whole concept of FDR and FDR estimation would be useless if we knew: If we knew, we could simply discard the incorrect hits and continue to work with the correct ones.

To this end, a method for FDR estimation tries to estimate the FDR of a given set of annotations, without having complete knowledge. It is important to understand the part that says, “tries to estimate”. Just because a method claims it is estimating FDR, does not mean it is doing a good job, or even anything useful. For example, consider a random number generator that outputs random values between 0 and 1: This IS an FDR estimation method. It is neither accurate nor useful, but that is not required in the definition. Also, a method for FDR estimation may always output the same, fixed number (say, always-0 or always-1). Again, this is a method for FDR estimation; again, it is neither accurate nor useful. Hence, be careful with papers that claim to introduce an FDR estimation method, but fail to demonstrate that these estimates are accurate or at least useful.

But how can we demonstrate that a method for FDR estimation is accurate or useful? Doing so is somewhat involved because FDR estimates are statistical measures, and in theory, we can only ask if they are accurate for the expected value of the estimate. Yet, if someone presents a novel methods for FDR estimation, the minimum to ask for is a series of q-value plots that compare estimated q-values and exact q-values: A q-value is the smallest FDR for which a particular hit is part of the output. Also, you might want to see the distribution of p-values, which should be uniform. You probably know p-values; if you can estimate FDR, chances are high that you can also estimate p-values. Both evaluations should be done for multiple datasets, to deal with the stochastic nature of FDR estimation. Also, since you have to compute exact FDRs, the evaluation must be done for reference datasets where the true answer is known. Do not confuse “the true answer is known” and “the true answer is known to the method“; obviously, we do not tell our method for FDR estimation what the true answer is. If a paper introduces “a method for FDR estimation” but fails to present convincing evaluations that the estimated FDRs are accurate or at least useful, then you should be extremely careful.

Now, how does FDR estimation work in practice? In the following, I will concentrate on shotgun proteomics and peptide annotation, because this task is most similar to metabolomics. There, target-decoy methods have been tremendously successful: You transform the original database you search in (called target database in the following) into a second database that contains only candidates that are incorrect. This is the decoy database The trick is to make the candidates from the decoy database “indistinguishable” from those in the target database, as further described below. In shotgun proteomics, it is surprisingly easy to generate a useful decoy database: For every peptide in the target database, you generate a peptide in the decoy database for which you read the amino acid sequence from back to front. (In more detail, you leave the last amino acid of the peptide untouched, for reasons that are beyond what I want to discuss here.)

To serve its purpose, a decoy database must fulfill three conditions: (i) There must be no overlap between target and decoy database; (ii) all candidates from the decoy database must never be the correct answer; and (iii), false hits in the target database have the same probability to show up as (always false) hits from the decoy database. For (i) and (ii), we can be relatively relaxed: We can interpret “no overlap” as “no substantial overlap”. This will introduce a tiny error in our FDR estimation, which is presumably irrelevant in comparison to the error that is an inevitable part of FDR estimation. For (ii), this means that whenever a search with a query spectrum returns a candidate from the target database, this is definitely not the correct answer. The most important condition is (iii), and if we look more precisely, we will even notice that we have to demand more: That is, the score distribution of false hits from the target database is identical to the score distribution of hits from the decoy database. If our decoy database fulfills all three conditions, then we can use a method such as Target-Decoy Competition and utilize hits from the decoy database to estimate the ration of incorrect hits from the target database. Very elegant in its simplicity.

Enough of the details, let us talk about untargeted metabolomics! Can we generate a decoy database that fulfills the three requirements? Well — it is difficult, and you might think that this is why I argue that FDR estimation is impossible here: Looks like we need a method to produce metabolite structure that look like true metabolites (or, more generally, small molecules of biological interest — it does not matter). Wow, that is already hard — we cannot simply use random molecular structure, because they will not look like the real thing, see (iii). In fact, a lot of research is currently thrown at this problem, as it would potentially allow us to find the next super-drug. Also, how could we guarantee not to accidentally generate a true metabolite, see (ii)? Next, looks like we need a method to simulate a mass spectrum for a given (arbitrary) molecular structure. Oopsie daisies, that is also pretty hard! Again, ongoing research, far from being solved, loads of papers, even a NeurIPS competition coming up.

So, are these the problems we have to solve to do FDR estimation, and since we cannot do those, we also cannot do FDR? In other words, if — in a decade or so — we would finally have a method that can produce decoy molecular structures, and another method that simulates high-quality mass spectra, would we have FDR estimation? Unfortunately, the answer is, No. In fact, we do not even need those methods: In 2017, my lab developed a computational method that transforms a target spectral library into a decoy spectral library, completely avoiding the nasty pitfalls of generating decoy structures and simulating mass spectra. Also, other computational methods (say, naive Bayes) completely avoid generating a decoy database.

The true problem is that we are trying to solve a non-statistical problem with a statistical measure, and that is not going to work, no matter how much we think about the problem. I repeat the most important sentence from above: “False hits in the target database have the same probability to show up as hits from the decoy database.” This sentence, and all stochastic procedure for FDR estimation, assume that a false hit in the target database is something random. In shotgun proteomics, this is a reasonable assumption: The inverted peptides we used as decoys basically look random, and so do all false hits. The space of possible peptides is massive, and the target peptides lie almost perfectly separated in this ocean of non-occurring peptides. Biological peptides are sparse and well-separated, so to say. But this is not the case for metabolomics and other molecules of biological interest. If an organism has learned how to make a particular compound, it will often also be able to synthesize numerous compounds that are structurally extremely similar. A single hydrogen replaced by a hydroxy group, or vice versa. A hydroxy group “moving by a single carbon atom”. Everybody who has ever looked at small molecules, will have noticed that. Organisms invest a lot of energy to make proteins for exactly this purpose. But this is not only a single organism; the same happens in biological communities, such as the microbiota in our intestines or on our skin. It even happens when no organisms are around. In short: No metabolite is an island.

Sadly, this is the end of the story. Let us assume that we have identified small molecule A for a query MS/MS when in truth, the answer should be B. In untargeted metabolomics and related fields, this usually means that A and B are structurally highly similar, maybe to the point of a “moving hydroxy group”. Both A and B are valid metabolites. There is nothing random or stochastic about this incorrect annotation. Maybe, the correct answer B was not even in the database we searched in; potentially, because it is a “novel metabolite” currently not known to mankind. Alternatively, both compounds were in the database, and our scoring function for deciding on the best candidate simply did not return the correct answer. This will happen, inevitably: Otherwise, you again need a demigod or demon to build the scoring function. Consequently, speaking about the probability of an incorrect hit in the target database, cannot catch the non-random part of such incorrect hits. There is no way to come up with an FDR estimation method that is accurate, because the process itself is not stochastic. Maybe, some statisticians will develop a better solution some day, but I argue that there is no way to ever “solve it for good”, given our incomplete knowledge of what novel metabolites remain to be found out there.

Similar arguments, by the way, hold true for the field of shotgun metaproteomics: There, our database contains peptides from multiple organisms. Due to homologous peptide sequences in different organisms, incorrect hits are often not random. In particular, there is a good chance that if PEPTIDE is in your database, then so is PEPITDE. Worse, one can be in the database you search and one in your biological sample. I refrain from discussing further details; after all, we are talking about metabolites here.

But “hey!”, you might say, “Sebastian, you have published methods for FDR estimation yourself!” Well, that is true: Beyond the 2017 paper mentioned above, the confidence score of COSMIC is basically trying to estimate the Posterior Error Probability of a hit, which is a statistical measure again closely related to FDRs and FDR estimation. Well, you thought you got me there, did you? Yet, do you remember that above, I talked about FDR estimation methods that are accurate or useful? The thing is: FDR estimation methods from shotgun proteomics are enviably accurate, with precise estimates at FDR 1% and below. Yet, even if our FDR estimation methods in metabolomics can never be as accurate as those from shotgun proteomics, that does not mean they cannot be useful! We simply have to accept the fact that our numbers will not be as accurate, and that we have to interpret those numbers with a grain of salt.

The thing is, ballpark estimates can be helpful, too. Assume you have to jump, in complete darkness, into a natural pool of water below. I once did that, in a cave in New Zealand. It was called “cave rafting”, not sure if they still offer that type of organized madness. (Just checked, it does, highly recommended.) Back then, in almost complete darkness, our guide told me to jump; that I would fall for about a meter, and that the water below was two meters deep. I found this information to be extremely reassuring and helpful, but I doubt it was very accurate. I did not do exact measurements after the jump, but it is possible that I fell for only 85cm; possibly, the water was 3m deep. Yet, what I really, really wanted to know at that moment, was: Is it an 8m jump into a 30cm pond? I would say the guide did a good job. His estimates were useful.

I stress that my arguments should not be taken as an excuse to do a lazy evaluation. Au contraire! As a field, we must insist that all methods marketed as FDR estimation methods, are evaluated extensively as such. FDR estimations should be as precise as possible, and evaluations as described above are mandatory. Because only this can tell us how for we can trust the estimates. Because only this can convince us that estimates are indeed useful. Trying to come up with more accurate FDR estimates is a very, very challenging task, and trying to do so may be in vain. But remember: We choose to do these things, not because they are easy, but because they are hard.

I got the impression that few people in untargeted metabolomics and related fields are familiar with the concept of FDR and FDR estimation for annotation. This strikes me as strange, given the success of these concepts in other OMICS fields. If you want to learn more, I do not have the perfect link or video for you, but I tried my best to explain it in my videos about COSMIC, see here. If you have a better introduction, let me know!

Funny side note: If you were using a metascore, your FDR estimates from target-decoy competition would always be 0%. As noted, a method for FDR estimation must not return accurate or useful values, and a broken scoring can easily also break FDR estimation. Party on!

Meet Sebastian at the Okinawa Workshop for Computational Mass Spectrometry

20/06/2024 by Wei Tang

On June 24, Sebastian will give a presentation on SIRIUS 6 at the Okinawa Workshop for Computational Mass Spectrometry.

Meet Jonas at the Computational Metabolomics Workshop in Aberdeen

20/06/202412/06/2024 by Wei Tang

At the Computational Mass Spectrometry Workshop (19-21 June 2024), Jonas will hold a presentation on SIRIUS 6.

Meet Sebastian, Kai and Fleming at the Metabolomics conference in Osaka

20/06/202406/06/2024 by Wei Tang

At the Metabolomics 2024 (16-20 June), Sebastian will give a keynote presentation. Kai and Fleming will hold a workshop on SIRIUS 6.

Meet Markus at the ASMS conference in Anaheim

20/06/202428/05/2024 by Wei Tang

Meet Markus at the ASMS conference 2024 in Anaheim between June 2-6!

Meet Andrés and Nils at the European School of Metabolomics in Granada

23/04/2024 by Nils Haupt

Meet us at the EUSM 2024 in Granada! Nils will give a hands-on session on SIRIUS.

Prague workshop will be streamed

14/04/2024 by Sebastian Böcker

Good news for those who want to attend the Prague workshop but were not admitted: The Prague workshop on computational mass spectrometry will be streamed, see here. You should get the necessary software installed beforehand if you want to join in.

IMPRS call for PhD student

06/03/2024 by Sebastian Böcker

The International Max Planck Research School at the Max Planck Institute for Chemical Ecology in Jena is looking for PhD students. One of the projects (Project 7) is from our group on “rethinking molecular networks”. Application deadline is April 19, 2024.

Mass spectrometry (MS) is the analytical platforms of choice for high-throughput screening of small molecules and untargeted metabolomics. Molecular networks were introduced in 2012 by the group of Pieter Dorrestein, and have found widespread application in untargeted metabolomics, natural products research and related areas. Molecular networking is basically a method of visualizing your data, based on the observation that similar tandem mass spectra (MS/MS) often correspond to compounds that are structurally similar. Constructing a molecular network allows us to propagate annotations through the network, and to annotate compounds for which no reference MS/MS data are available. Since its introduction, the computational method has received a few “updates”, including Feature-Based Molecular Networks and Ion Identity Molecular Networks. Yet, the fundamental idea of using the modified cosine to compare tandem mass spectra, has basically remained unchanged at the core of the method.

In this project, we want to “rethink molecular networks”, replacing the modified cosine by other measures of similarity, including fragmentation tree similarity, the Tanimoto similarity of the predicted fingerprints, and False Discovery Rate estimates. See the project description for details.

We are searching for a qualified and motivated candidate from bioinformatics, machine learning, cheminformatics and/or computer science who want to work in this exciting, quickly evolving interdisciplinary field. Please contact Sebastian Böcker in case of questions. Payment is 0.65 positions TV-L E13.

IMPRS: https://www.ice.mpg.de/129170/imprs
MPI-CE: https://www.ice.mpg.de/
SIRIUS & CSI:FingerID: https://bio.informatik.uni-jena.de/software/sirius/
Literature: https://bio.informatik.uni-jena.de/publications/ and https://bio.informatik.uni-jena.de/textbook-algoms/

Jena is a beautiful city and wine is grown in the region:
https://www.youtube.com/watch?v=DQPafhqkabc
https://www.google.com/search?q=jena&tbm=isch
https://www.study-in.de/en/discover-germany/german-cities/jena_26976.php

Prague workshop on computational MS overbooked

10/02/202410/02/2024 by Sebastian Böcker

Unfortunately, the Prague Workshop on Computational Mass Spectrometry (April 15-17, 2024) is heavily overbooked. The organizers will try to stream the workshop and the recorded sessions will be made available online, so check there regularly.

The workshop is organized by Tomáš Pluskal and Robin Schmid (IOCB Prague). Marcus Ludwig (Bright Giant) and Sebastian will give a tutorial on SIRIUS, CANOPUS etc.

Topics of the workshop are: MZmine, SIRIUS, matchms, MS2Query, LOTUS, GNPS, MassQL, MASST, and Cytoscape.

Meet Sebastian at the DGMS conference in Freising

10/02/2024 by Sebastian Böcker

Meet Sebastian at the conference of the Deutsche Gesellschaft für Massenspektrometrie (DGMS 2024) in Freising! The conference is March 10-13, and Sebastian will give a keynote talk on Tuesday, March 12.

BTW, another keynote will be given by our close collaboration partner Michael Witting, also on March 12.

RepoRT has appeared in Nature Methods

08/01/202408/01/2024 by Sebastian Böcker

Our paper “RepoRT: a comprehensive repository for small molecule retention times” has just appeared in Nature Methods. This is joint work with Michael Witting (Helmholtz Zentrum München) as part of the DFG project “Transferable retention time prediction for Liquid Chromatography-Mass Spectrometry-based metabolomics“. Congrats to Fleming, Michael and all co-authors! In case you do not have access to the paper, you can find the preprint here and a read-only version here.

RepoRT is a repository for retention times, that can be used for any computational method development towards retention time prediction. RepoRT contains data from diverse reference compounds measured on different columns with different parameters and in different labs. At present, RepoRT contains 373 datasets, 8809 unique compounds, and 88,325 retention time entries measured on 49 different chromatographic columns using varying eluents, flow rates, and temperatures. Access RepoRT here.

If you have measured a dataset with retention times of reference compounds (that is, you know all the compounds identities) then please, contribute! You can either upload it to GitHub yourself, or you can contact us in case you need help. In the near future, a web interface will become available that will make uploading data easier. There are a lot of data in RepoRT already, but don’t let that fool you; to reach a transferable prediction of retention time and order (see below), this can only be the start.

If you want to use RepoRT for machine learning and retention time or order prediction: We have done our best to curate RepoRT: We have searched and appended missing data and metadata; we have standardized data formats; we provide metadata in a form that is accessible to machine learning; etc. For example, we provide real-valued parameters (Tanaka, HSM) to describe the different column models, in a way that allows machine learning to transfer between different columns. Yet, be careful, as not all data are available for all compounds or datasets. For example, it is not possible to provide Tanaka parameters for all columns; please see the preprint on how you can work your way around this issue. Similarly, not all compounds that should have an isomeric SMILES, do have an isomeric SMILES; see again the preprint. If you observe any issues, please let us know. See this interesting blog post and this paper as well as our own preprint on why providing “clean data” as well as “good coverage” are so important issues for small molecule machine learning.

Bioinformatische Methoden in der Genomforschung muss leider ausfallen

29/09/2023 by Sebastian Böcker

Nach aktuellem Kenntnisstand muss das Modul “Bioinformatische Methoden in der Genomforschung” im WS 23/24 leider ausfallen. Wir dürfen die Mitarbeiterstelle nicht besetzen, die wir dafür zwingend brauchen. Wir haben gekämpft und argumentiert und alles getan was wir konnten, aber am Ende war es leider vergeblich. Das Modul findet voraussichtlich das nächste Mal im WS 25/26 statt.

Warnung: Im Zuge der Sparmaßnamen an der FSU Jena kann es in Zukunft häufiger zu sollen kurzfristigen Ausfällen kommen.

Meet Sebastian at the Munich Metabolomics Meeting

28/09/2023 by Sebastian Böcker

Sebastian will give a tutorial on using SIRIUS and beyond at the Munich Metabolomics Meeting 2023.

HUMAN EU PhD position is still open

04/01/202426/09/2023 by Sebastian Böcker

Unfortunately, we have not been able to fill the PhD position for the HUMAN EU project so far. In case you are interested, please contact us!

Update: The position has been filled.

Neues Video zum Studium Bioinformatik

25/09/2023 by Sebastian Böcker

Im Rahmen des MINT Festivals in Jena hat Sebastian ein neues Video zum Studium der Bioinformatik aufgenommen: “Kleine Moleküle. Was uns tötet, was uns heilt“. Es richtet sich vom Vorwissen her an Schüler aus der Oberstufe, aber vielleicht können auch Schüler aus den Jahrgangsstufen darunter etwas mitnehmen. Das Video ist erst mal nur über diese Webseite zu erreichen, wird aber in Kürze bei YouTube hochgeladen.

Am Ende noch zwei Fragen: Erstens, welche Zelltypen im menschlichen Körper enthalten nicht die (ganze) DNA? Da war ich aus Zeitgründen bewusst schlampert; keine Regel in der Biologie ohne Ausnahme. Und zweitens, NMR Instumente werden häufig nicht mit flüssigem Stickstoff gekühlt, sondern mit… was? Wie so oft geht es dabei ums liebe Geld.

Meet Nils, Wei, Fleming and Sebastian at GCB 2023

15/09/202310/09/2023 by Sebastian Böcker

Meet us at the GCB 2023 in Hamburg! Nils, Wei and Fleming are going to present posters, and Sebastian will give a keynote talk.

Meet Sebastian (remotely) and Fleming at the Swedish TB meeting

22/09/202321/08/2023 by Fleming Kretschmer

Sebastian and Fleming will participate in the Swedish National Tuberculosis meeting: Sebastian will give a talk remotely and Fleming will be on site in Umeå to give a hands-on session on SIRIUS.

Visualizing the universe of small biomolecules

08/12/202312/07/2023 by Fleming Kretschmer

Have you ever wanted to look at the universe of biomolecules (small molecules of biological interest, including metabolites and toxins)? Have you ever wondered how your own dataset fits into this universe? In our preprint, we introduce a method to do just that, using MCES distances to create a UMAP visualization. Onto this visualization, any compound-dataset can be projected, see the interactive example below. In case it is slow, download the code here. Move your mouse over any dot to see the underlying molecular structure.

If you are wondering, “where did the lipids go?”, check this out. See the preprint on why we excluded them above. Looking at commonly used datasets for small molecule machine learning, big differences can be seen in the coverage of the biomolecule space. For example, the toxicity datasets Tox21 and ToxCast appear to rather uniformly cover the universe of biomolecules. In contrast, SMRT is a massive retention time dataset, but appears to be concentrated on a specific area of the compound space. The thing is: One must not expect a machine learning model trained on only a small part of the “universe of biomolecules”, to be applicable to the whole universe. This is a little too much to be asked. Hence, visualizing your data in this way may give you a better understanding of what your machine learning model is actually doing, where it will thrive and where it might fail.

To compare molecular structures, we compute the MCES (Maximum Common Edge Subgraph) of the two molecular structures. Doing so is not new, but comes at the prize that computing a single distance is already an NP-hard problem, see below. Then, why on Earth are we not using Tanimoto coefficients computed from molecular fingerprints, just like everybody else does? Tanimoto coefficients and related fingerprint-based similarity and dissimilarity measures have a massive advantage over all other means of comparing molecular structures: As soon as you have computed the fingerprints of all molecular structures, computing Tanimoto coefficients is blindingly fast. Hence, if you are querying a database, molecular fingerprints are likely the method of choice. We ourselves have been and are heavily relying on molecular fingerprints: CSI:FingerID is predicting molecular fingerprints from MS/MS data, CANOPUS is predicting compound classes from molecular fingerprints, and COSMIC is using Tanimoto coefficients because they are, well, fast. Yet, if you have ever worked with molecular fingerprints and, in particular, Tanimoto coefficients yourself, you must have also noticed their peculiarities, quirks and shortcomings. In fact, from the moment people used Tanimoto coefficients, others have warned about these unexpected and highly undesirable behaviors; an early example is by Flower (1998). On the one hand, a Tanimoto coefficient of below 0.7 can be the result of two compounds with only one hydroxy group added. On the other hand, two highly different compounds, one half the size of the other, may also have a Tanimoto coefficient of 0.7. Look at the two examples below: According to the Tanimoto coefficient, the two structures on the left are less similar than the two on the right. Does that sound right? By the way: The same holds true for any fingerprint-based similarity or dissimilarity measure, and also for any other fingerprint type. These are examples but the problem is universal.

In contrast, the MCES distance is much more intuitive to interpret, as it is the edit distance between molecule graphs and, hence, nicely represents our intuition of chemical reactions. For example, adding an hydroxy group results in an MCES distance of one. Don’t get us wrong: The MCES distance is not perfect, either. First and foremost, the MCES problem is NP-hard; hence, computing a single exact distance between two molecules might take days or weeks. We can happily report that we have “solved” this issue by introducing the myopic MCES distance: We first quickly compute a lower bound on the true distance. If this bound tells us that the true distance is larger than 22, then we would argue that knowing the exact value (maybe 22, maybe 25, maybe 32) is of little help: These two molecules are very, very different, full stop. But if we find that the lower bound only guarantees that the distance is small (say, at most 20) then we use exact computations based on solving an Integer Linear Program. With some more algorithm engineering, we were able to bring down computation time to fractions of a second. And that means that we were able to compute all distances for a set of 20k biomolecular structures, plus several well-known machine learning datasets, in reasonable time and on our limited compute resources. (Sadly, we still do not own a supercomputer.) You will not be able to do all-against-all with a million molecular structures, so if your research requires to do so, you might have to stick with the Tanimoto coefficient, quirky as it is. Yet, we found that subsampling does indeed give us rather reproducible results, see Fig. 7 of the preprint (page 23).

There are other shortcomings of the MCES distance: For example, it is not well-suited to capture the fact that one molecular structure is a substructure of the other. This is unquestioned, but what is also true, is: The MCES distance does not have peculiarities or quirks. The fact that it does not capture substructures, can be readily derived from its definition; this behavior is by design. In case you do not like the absolute MCES distance, because you think that large molecules are treated unfairly, then feel free to normalize it using the size of the molecules. Now that we can (relatively) swiftly compute the myopic MCES distance, we can play around with it.

We used UMAP (Uniform manifold approximation and projection) to visualize the universe of biomolecules but, honestly, we don’t care. You prefer t-SNE? Use that! You prefer a tree-based visualization? Use that! See the following comparison (from left to right UMAP, t-SNE and a Minimum Spanning Tree), created in just a few minutes. Or, maybe Topological Data Analysis? Fine, too! All those visualizations have their pros and cons, and one should always keep Pachter’s elephant in the back of one’s brain. But the thing is: We know that the space of molecular structures has an intrinsic structure, and we are merely using the different visualization methods to get a feeling for its intrinsic structure.

Now, one peculiarity of the above UMAP plots must be mentioned here: When comparing different ML training datasets, we re-used the UMAP embedding computed from the biomolecular structures alone (Fig. 3 in the preprint). Yet, UMAP will neatly integrate any new structures into the existing “compound universe” even if those new structures are very, very different from the ones that were used to compute the projection. This is by design of UMAP, it interpolates, all good. So, we were left with two options: Recompute the embedding for every subplot? This would allow us to spot if a dataset contains compounds very different from all biomolecular structures, but would result in a “big mess” and an uneven presentation. Or, should we keep the embedding fixed? This makes a nicer plot but hides “alien compounds”. We went with the second option solely because the overall plot “looks nicer”; in practice, we strongly suggest to also compute a new UMAP embedding.

In the preprint, we discuss two more ways to check whether a training dataset provides uniform coverage of the biological compound space: We examine the compound class distribution, to check whether certain compound classes are “missing” in our training dataset. And finally, we use the Natural Product-likeness score distribution to check for lopsidedness. All of that can give you ideas about the data you are working with. There have been numerous scandals about machine learning models repeating prejudice in the training data; don’t let the distribution of molecules in your training data let you draw conclusions which, at closer inspection, might be lopsided or even wrong.

If you want to compute myopic MCES distances yourself, find the source code here. You will need a cluster node, or proper patience if you do computations on your laptop. All precomputed myopic MCES distances from the preprint can be found here. We may also be able to help you with further computations.

Retention time repository preprint out now

06/07/202306/07/2023 by Sebastian Böcker

The RepoRT (well, Repository for Retention Times, you guessed it) preprint is available now. It has been a massive undertaking to get to this point; honestly, we did not expect it to be this much work. It is about diverse reference compounds measured on different columns with different parameters and in different labs. At present, RepoRT contains 373 datasets, 8809 unique compounds, and 88,325 retention time entries measured on 49 different chromatographic columns using varying eluents, flow rates, and temperatures. Access RepoRT here.

If you want to do anything with the data, be our guests! It is available under the Creative Commons License CC-BY-SA.

If you want to use RepoRT for machine learning and retention time or order prediction, then, perfect! That is what we intended it for. 🙂 We have done our best to curate RepoRT: We have searched and appended missing data and metadata; we have standardized data formats; we provide metadata in a form that is accessible to machine learning; etc. For example, we provide real-valued parameters (Tanaka, HSM) to describe the different column models, in a way that allows machine learning to transfer between different columns. Yet, be careful, as not all data are available for all compounds or datasets. For example, it is not possible to provide Tanaka parameters for all columns; please see the preprint on how you can work your way around this issue. Similarly, not all compounds that should have an isomeric SMILES, do have an isomeric SMILES; see again the preprint. If you observe any issues, please let us know.

Back from ASMS 2023

13/06/2023 by Marcus Ludwig

Markus, Martin and Marcus have been at ASMS 2023 in Houston, USA.

Thanks everyone for the great discussions at our poster.