Im kommenden Wintersemester 2025/2026 wird wieder das Modul Bioinformatische Methoden in der Genomforschung angeboten. Im Sommersemester 2026 wird dann wieder das Modul Sequenzanalyse zu hören sein.
News
Retention time prediction is dead, long live retention order index prediction!
We just solved a problem that I never wanted to solve: That is, transferable retention time prediction of small molecules. They say that the best king is someone who does not want to be king; so maybe, the best problem solver is someone who does not want to solve the problem? Not sure about that. (My analogies have been quite monarchist lately; we are watching too much The Crown.)
What is transferable retention time prediction? Let’s assume you tell me some details about your chromatographic setup. We are talking about liquid, reversed-phase chromatography. You tell me what column you use, what gradient, what pH, maybe what temperature. Then, you say “3-Ketocholanic acid” and I say, “6.89 min”. You say “Dibutyl phthalate”, I say “7.39 min”. That’s it, folks. If you think that is not possible: Those are real-world examples from a biological dataset. The true (experimentally measured) retention times where 6.65 min and 7.41 min.
Now, you might say, “but how many authentic standards were measured on the same chromatographic setup so that you could do your predictions”? Because that is how retention time prediction works, right? You need authentic standards measured on the same system so that you can do predictions, right? Not for transferable retention time prediction. For our method, the answer to the above question is zero, at least in general. For the above real-world example, the answer is 19. Nineteen. In detail, 19 NAPS (N-Alkylpyridinium 3-sulfonate) molecules were measured in an independent run. Those standards were helpful, but we could have done without them. If you know a bit about retention time prediction: There was no fine-tuning of some machine learning model on the target dataset. No, sir.
Now comes the cool part: Our method for retention time prediction already performs better than any other method for retention time prediction. The performance is almost as good as that of best-in-class methods if we do a random (uniform) split of the data on the target system. Yet, we all know (or, we all should know) that uniform splitting of small molecules is not a smart idea; it results in massively overestimating the power of machine learning models. Now, our model is not trained (fine-tuned) on data from the target system. Effect: Our method works basically equally well for any splitting of the target dataset. And, we already outperform the best-in-class method.
Why am I telling you that? I mean, besides showing off? (I sincerely hope that you share my somewhat twisted humor.) And, why did I say “already” above? Thing is, retention time prediction is dead, and you might not want to ruin your PhD student’s career by letting him/her try to develop yet another machine learning model for retention time prediction. But now, there is a new, much cooler task for you and/or your PhD student!
- If you are into graph machine learning, have a look at the problem of retention order index (ROI) prediction. This problem is both challenging and relevant, I can guarantee you that. Different from somewhat ill-posed problems of predicting complex biological traits such as toxicity, there is good reason to believe that ROI prediction can ultimately be “solved”, meaning that predictions become more accurate than experimental errors. On the other hand, ROI prediction is challenging, so complex & intricate models can demonstrate their power. Data are already available for thousands of compounds, trending upward. In conjunction with our 2-step approach, a better model for ROI prediction will automatically result in a better method for transferable retention time prediction. To give a ballpark estimate on the relevance: There are more than 2.1 million scientific studies that use liquid chromatography for the analysis of small molecules, and the chromatography market has an estimated yearly volume of 4 to 5 billion Euro. Your method may allow users to predict retention times for compounds that are illegal to have in the lab (be it tetrodoxin or cocaine), for all compounds in all spectral libraries and molecular structure databases combined, for compounds that do not even exist or which we presently do not know about.
- If you are doing LC experiments, you might want to upload your LC dataset with authentic standards to RepoRT. I would assume that our 2-step method (see below for details) will be integrated into many computational tools for compound annotation shortly; at least for one method I am sure. Alternatively, you might want to use 2-step to plan you next experiment, maximizing the separation of your compounds. Now, the 2-step method can truly do transferable prediction of retention times: It can do the predictions even for columns that are altogether missing from the training data. But things get easier and predictions get more accurate if there is a dataset in the training data with exactly your chromatographic setup. Let’s not make things overly complicated: A grandmaster in chess may be able to play ten games blindfolded and simultaneously, but s:he would not do so unless it is absolutely necessary. To this end, upload your reference datasets now, get better predictions in the very near future! It does not matter what compounds are in your dataset, as long as it contains a reasonable number (say, 100) of compounds. Everything helps; and in particular, it will help your own predictions for your chromatographic setup!
Now, how does the 2-step method work? The answer is: We do not try to predict retention times, because that is impossible. Instead, we train a machine learning model that predicts a retention order index. A retention order index (ROI) is simply a real-valued number, such as 0.123 or 99999. The model somehow integrates the chromatographic conditions, such as the used column and the mobile phase. Given two compounds A and B from some dataset (experiment) where A elutes before B, we ask the model to predict two number x (for compound A) and y (for compound B) such that x < y holds. If we have an arbitrary chromatographic setup, we use the predicted ROIs of compounds to decide in which order the compounds elute in the experiment. And that is already the end of the machine learning part. But I promised you transferable retention time, and now we are stuck with some lousy ROI? Thing is, we found that you can easily map ROIs to retention times. All you need is a low-degree polynomial, degree two was doing the trick for us. No fancy machine learning needed, good old curve fitting and median regression will do the trick. To establish the mapping, you need a few compounds where you know both the (predicted) ROI and the (measured) retention time. These may be authentic standards, such as the 19 NAPS molecules mentioned above. If you do not have standards, you can also use high-confidence spectral library hits. This can be an in-house library, but a public library such as MassBank will also work. Concentrate on the 30 best hits (cosine above 0.85, and 6+ matching peaks); even if half of these annotations are wrong, you may construct the mapping with almost no error. And voila, you can now predict retention times for all compounds, for exactly this chromatographic setup! Feed the compound into the ROI model, predict a ROI, map the ROI to a retention time, done. Read our preprint to learn all the details.
At this stage, you may want to stop reading, because the rest is gibberish; unimportant “historical” facts. But maybe, it is interesting; in fact, I found it somewhat funny. I mentioned above that I never wanted to work on retention time prediction. I would have loved if somebody else would have solved the problem, because I thought my group should stick with the computational analysis of mass spectrometry data from small molecules. (We are rather successful doing that.) But for a very long time, there was little progress. Yes, models got better using transfer learning (pretrain/fine-tune paradigm), DNNs and transformers; but that was neither particularly surprising nor particularly useful. Not surprising because it is an out-of-the-textbook application of transfer learning; not particularly useful because you still needed training data from the target system to make a single prediction. How should we ever integrate these models with CSI:FingerID?
In 2016, Eric Bach was doing his Master thesis and later, his PhD studies on the subject, under the supervision of Juho Rousu at Aalto University. At some stage, the three of us came up with the idea to consider retention order instead of retention times: Times were varying all over the place, but retention order seemed to be easier to handle. Initially, I thought that combinatorial optimization was the way to go, but I was wrong. Credit to whom credit is due: Juho had the great idea to predict retention oder instead of retention time. For that, Eric used a Ranking Support Vector Machine (RankSVM); this will be important later. This idea resulted in a paper at ECCB 2018 (machine learning and bioinformatics are all about conference publications) on how to predict retention order, and how doing so can (very slightly) improve small molecule annotation using MS/MS. (Eric and Juho wrote two more papers on the subject.)
The fact that Eric was using a RankSVM had an interesting consequence: The RankSVM is trained on pairs of compounds, but what it is actually predicting is a single real-valued number, for a given compound structure. For two compounds, we then compare the two predicted numbers; the smaller number tells you the compound that is eluting first. In fact, the machine learning field of learning-to-rank does basically the same, in order to decide which website best fits with your Google DuckDuckGo query. I must say that back then, I did not like this approach at all; it struck me as overly complicated. But after digesting it for some time, I had an idea: What if we do not use ROIs to decide which compounds elutes first? What if we instead map these ROIs to retention times, using non-linear regression? Unfortunately, I never got Eric interested to go into this direction. And I tried; yes I tried.
About seven years ago, Michael Witting and I rather reluctantly decided to jointly go after the problem. (I would not dare to go after a problem that requires massive chemical intuition without an expert; if you are a chemist, I hope you feel the same about applying machine learning.) We had my crazy idea of mapping ROIs to retention times, and a few backup plans in case this would fail. We had two unsuccessful attempts to secure funding from the Deutsche Forschungsgemeinschaft. My favorite sentence from the second round rejection is, “this project should have been supported in the previous round as the timing would have been better”; we all remember how the field of retention time prediction dramatically changed from 2018 to 2019…? But as they say, it is the timing, stupid! We finally secured funding in May 2019; and I could write, “and the rest is history”. But alas, no. Because the first thing we had to learn was that there was no (more precisely, not enough) training data. Jan Stanstrup had done a wonderful but painstaking job to manually collect datasets “from the literature” for his PredRet method. Yet, a lot of metadata were missing (these data are not required for PredRet, we must not complain) and we had to dig into the literature and data to close those gaps. Also, we clearly needed more data to train a machine learning model. With this, our hunt for more data began, which we partly scraped from publications, partly measured on our own; kudos go to Eva Harrieder! This took much, much longer than we would ever have expected, and resulted in the release of RepoRT in 2023. Well, after that, it is the usual. It works, it doesn’t work, maybe it works; next, it is all bad again; the other methods are so much better; the other methods have memory leakage; now, we have memory leakage; and finally, success. Phew… And here it is.
What about HILIC? I have no idea; but I can tell you that it is substantially more complicated. In fact, there currently is not even a description of columns comparable to HSM and Tanaka for RP. Maybe, somebody wants to get something started? Also, HILIC columns are much more diverse than RP columns, meaning that we need more training data (currently, we have less) and a better way to describe the column than for RP. But hey, start now so that it is ready in 10 years; worked for us!
References
- Kretschmer, F., Harrieder, E.-M., Witting, M., Böcker, S. Times are changing but order matters: Transferable prediction of small molecule liquid chromatography retention times. ChemRxiv, https://doi.org/10.26434/chemrxiv-2024-wd5j8-v3, 2024. Version 3 from August 2025.
- Bach, E., Szedmak, S., Brouard, C., Böcker, S. & Rousu, J. Liquid-Chromatography Retention Order Prediction for Metabolite Identification. Bioinformatics 34. Proc. of European Conference on Computational Biology (ECCB 2018), i875–i883 (2018). [DOI]
- Bach, E., Rogers, S., Williamson, J. & Rousu, J. Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification. Bioinformatics 37, 1724–1731 (2021). [DOI]
- Bach, E., Schymanski, E. L. & Rousu, J. Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data. Nat Mach Intel 4, 1224–1237 (2022). [DOI]
- Stanstrup, J., Neumann, S. & Vrhovšek, U. PredRet: prediction of retention time by direct mapping between multiple chromatographic systems. Anal Chem 87, 9421–9428 (2015). [DOI]
- Snyder, L. R., Dolan, J. W. & Carr, P. W. The hydrophobic-subtraction model of reversed-phase column selectivity. J Chromatogr A 1060, 77–116 (2004). [DOI]
- Kimata, K., Iwaguchi, K., Onishi, S., Jinno, K., Eksteen, R., Hosoya, K., Araki, M. & Tanaka, N. Chromatographic Characterization of Silica C18 Packing Materials. Correlation between a Preparation Method and Retention Behavior of Stationary Phase. J Chromatogr Sci 27, 721–728 (1989). [DOI]
One billion queries served
Now, that number deserves a celebration: Our web services have served a billion queries! In detail, the CSI:FingerID web service processed 759 million queries, the CANOPUS web service 228 million, and MSNovelist has 17 million queries. During the last 12 months, 3816 users have processed their data using our web services. Excellent!
Remember when Gangnam Style was the first video on YouTube to hit 1 billion views? That was a big thing in December 2012; we do not expect the same amount of media coverage for our first billion, though. 😉 Indeed, one billion is a big number, and it is really hard to grasp it: See this beautiful video by Tom Scott to “understand” the difference between a million and a billion.
Anyways, many thanks to the team (mainly Markus Fleischauer) who make this possible, and who ensure that everything is running so smoothly! It should not be a surprise that processing 1.18 million queries per day (on average!), is a highly non-trivial undertaking. Auto-scalable, self-healing compute infrastructure and highly optimized database infrastructure are key ingredients for keeping small molecule annotations flowing smoothly to our global community.
PS. The one billion also cover the queries processed by the Bright Giant and their servers, who are offering these services to companies. One counter to rule them all…
Sebastian will receive an ERC advanced grant
Excellent news: Sebastian will receive an ERC Advanced Grant from the European Research Council!
In the project BindingShadows, we will develop machine learning models that predict whether some query molecule has a particular bioactivity (say, toxicity) or is binding to a certain protein, where the only information we have about the query molecule is its tandem mass spectrum. The clue is: The model is supposed to do that when the identity and structure of the molecule is not known; not known to us, not known to the person that measured the data, not known to mankind (aka novel compounds).
Clearly, we will not do that for one bioactivity or one protein, but for all of them in parallel. Evolution has, through variation and selection, optimized the structure of small molecules for tasks such as communication and warfare, and the pool of natural products is enriched with bioactive compounds. Our project will try to harvest this information.
Sounds interesting? Looking for a job? Stay tuned, as we will shortly start searching for PhD and postdoc candidates!
Meet Markus and Jonas at the 4th International Summer School in Non-Targeted Metabolomics Data Mining
Markus and Jonas will contribute to the summer school with a workshop about SIRIUS. For all interested, it takes place 18. to 22. August 2025 in Copenhagen, Denmark. We hope to meet many of you there!
Algorithmische Phylogenetik statt Sequenzanalyse im SoSe 2025
Im kommenden Semester wird statt Sequenzanalyse das Modul Algorithmische Phylogenetik angeboten. Sequenzanalyse wird dann wieder im SoSe 2026 zu hören sein.
SIRIUS 6.1.0 is Here!
We’re excited to announce the latest version of SIRIUS designed to improve your small molecule analysis workflow with a refreshed interface, streamlined processes, and several important fixes that ensure smoother performance and better data handling.
Highlights:
? New Color Scheme: A consistent and intuitive look throughout the entire identification process.
⚡ Streamlined Workflows: Tools are now automatically activated/deactivated to comply with SIRIUS workflow principles. You can also easily save and reload computation settings with our new preset function.
? Welcome Page Redesign: Get a quick overview of your account, connection details, and helpful resources to make the most of SIRIUS.
? Improved Result Views: Numerous enhancements and fixes across result views for a smoother analysis experience.
Times are changing but order matters
Our preprint Times are changing but order matters: Transferable prediction of small molecule liquid chromatography retention times is finally available on chemRxiv! Congratulations to Fleming and all authors.
In short, we show that prediction of retention times is a somewhat ill-posed problem, as retention times vary substantially even for nominally identical condition. Next, we show that retention order is much better preserved; but even retention order changes when the chromatographic conditions (column, mobile phase, temperature) vary. Third, we show that we can predict a retention order index (ROI) based on the compound structure and the chromatographic conditions. And finally, we show that one can easily map ROIs to retention times, even for target datasets that the machine learning model has never seen during training. Even for chromatographic conditions (column, pH) that are not present in the training data. Even for “novel compounds” that were not in the training data. Even if all of those restrictions hit us simultaneously.
In principle, this means that transferable retention time prediction (for novel compounds and novel chromatographic conditions) is “solved” for reversed-phase chromatography. There definitely is room for improvement, but that mainly requires more training data, not better methods. For that, we need your help: We need you, the experimentalists, to provide your reference LC-MS datasets. It does not matter what compounds are in there, it does not matter what chromatographic setup you used; as long as it is references data measured from standards (meaning that you know exactly each compound in the dataset), and as long as you tell us the minimum metadata, this will improve prediction quality.
You can post your data wherever you like, but you can also upload them to RepoRT so that in the future, scientists from machine learning can easily access them. We now provide a web app that makes uploading really simple. If you are truly interested in retention time prediction, then this should be well-invested 30 min.
Meet Jonas at the IMPRS course on mass spectrometry in the MPI for Chemical Ecology
Jonas will give a lecture on SIRIUS at the IMPRS course “Introduction to mass spectrometry techniques and analysis” (2-4 December) in the Max Planck Institute for Chemical Ecology.
Meet us at GCB 2024 in Bielefeld
At GCB 2024 (30th Sep. – 2nd Oct.), Fleming will give a presentation about RepoRT. Jonas, Leopold and Roberto will show their posters in the poster session.
Meet Sebastian at the SETAC Asia-Pacific Biennial Meeting in Tianjin
Sebastian will give a plenary talk at the SETAC Asia-Pacific 14th Biennial Meeting (September 21 to 25, 2024).
Meet Sebastian at the SMAP conference in Lille
Sebastian will give a plenary talk at the conference on Mass Spectrometry and Proteomic Analysis (SMAP), which will be held from September 16 to 19, 2024.
New Web-App for RepoRT
To make the submission process to RepoRT easier, we have launched a web app, available under https://rrt.boeckerlab.uni-jena.de. Submitters can upload their data without having to adhere to specific formatting requirements. Metadata and gradient information can be entered directly online. Part of the data validation happens during the uploading process, providing immediate feedback when something is wrong. Inside the web app, submitters can also keep track of all submissions and their current status. As we continue to work on improving the web app, your feedback is of course very welcome!

Meet Sebastian at the Okinawa Workshop for Computational Mass Spectrometry
On June 24, Sebastian will give a presentation on SIRIUS 6 at the Okinawa Workshop for Computational Mass Spectrometry.
Meet Jonas at the Computational Metabolomics Workshop in Aberdeen
At the Computational Mass Spectrometry Workshop (19-21 June 2024), Jonas will hold a presentation on SIRIUS 6.
Meet Sebastian, Kai and Fleming at the Metabolomics conference in Osaka
At the Metabolomics 2024 (16-20 June), Sebastian will give a keynote presentation. Kai and Fleming will hold a workshop on SIRIUS 6.
Meet Markus at the ASMS conference in Anaheim
Meet Markus at the ASMS conference 2024 in Anaheim between June 2-6!
Meet Andrés and Nils at the European School of Metabolomics in Granada
Meet us at the EUSM 2024 in Granada! Nils will give a hands-on session on SIRIUS.
Prague workshop will be streamed
Good news for those who want to attend the Prague workshop but were not admitted: The Prague workshop on computational mass spectrometry will be streamed, see here. You should get the necessary software installed beforehand if you want to join in.
IMPRS call for PhD student
The International Max Planck Research School at the Max Planck Institute for Chemical Ecology in Jena is looking for PhD students. One of the projects (Project 7) is from our group on “rethinking molecular networks”. Application deadline is April 19, 2024.
Mass spectrometry (MS) is the analytical platforms of choice for high-throughput screening of small molecules and untargeted metabolomics. Molecular networks were introduced in 2012 by the group of Pieter Dorrestein, and have found widespread application in untargeted metabolomics, natural products research and related areas. Molecular networking is basically a method of visualizing your data, based on the observation that similar tandem mass spectra (MS/MS) often correspond to compounds that are structurally similar. Constructing a molecular network allows us to propagate annotations through the network, and to annotate compounds for which no reference MS/MS data are available. Since its introduction, the computational method has received a few “updates”, including Feature-Based Molecular Networks and Ion Identity Molecular Networks. Yet, the fundamental idea of using the modified cosine to compare tandem mass spectra, has basically remained unchanged at the core of the method.
In this project, we want to “rethink molecular networks”, replacing the modified cosine by other measures of similarity, including fragmentation tree similarity, the Tanimoto similarity of the predicted fingerprints, and False Discovery Rate estimates. See the project description for details.
We are searching for a qualified and motivated candidate from bioinformatics, machine learning, cheminformatics and/or computer science who want to work in this exciting, quickly evolving interdisciplinary field. Please contact Sebastian Böcker in case of questions. Payment is 0.65 positions TV-L E13.
IMPRS: https://www.ice.mpg.de/129170/imprs
MPI-CE: https://www.ice.mpg.de/
SIRIUS & CSI:FingerID: https://bio.informatik.uni-jena.de/software/sirius/
Literature: https://bio.informatik.uni-jena.de/publications/ and https://bio.informatik.uni-jena.de/textbook-algoms/
Jena is a beautiful city and wine is grown in the region:
https://www.youtube.com/watch?v=DQPafhqkabc
https://www.google.com/search?q=jena&tbm=isch
https://www.study-in.de/en/discover-germany/german-cities/jena_26976.php