CDK used in SIRIUS 3: metabolomics tools from Germany

Screenshot from the SIRIUS 3 Documentation.
License: unknown.
It has been ages I blogged about work I heard about and think should receive more attention. So, I'll try to pick up that habit again.

After my PhD research (about machine learning (chemometrics, mostly), crystallography, QSAR) I first went into the field metabolomics. Because is combines core chemistry with the complexity biology. My first position was with Chris Steinbeck, in Cologne, within the bioinformatics institute led by Prof. Schomburg (of the BRENDA database). During that year, I worked in a group that worked on NMR data (NMRShiftDb, dr. Stefan Kuhn), Bioclipse (collaboration with Ola Spjuth), and, of course, the Chemistry Development Kit (see our new paper).

This new paper, actually, introduces functionality that was developed in that year, for example, work started by Miquel Rojas-Cheró. This includes the work on atom types, which we needed to handle radicals, lone pairs, etc, for delocalisation. It also includes work around handling molecular formula and calculating molecular formulas from (accurate) molecular masses. For the latter, more recent work even further improved on earlier work.

So, whenever metabolomics work is published and they use the CDK, I realize that what the CDK does has impact. This week Google Scholar alerted me about a user guidance document for SIRIUS 3 (see the screenshot). Seems really nice (great) work from Sebastian Böcker et al.!

It also makes me happy, as our Faculty of Heath, Medicine, and Life Sciences (FHML) is now part of the Netherlands Metabolomics Center, and that we published the recent article our vision of a stronger, more FAIR European metabolomics community.

GenX spill, national coverage, but where is the data

First (I have never blogged much about risk and hazard), I am not an toxicological expert nor a regulator. I have deepest respect for both, as these studies are one of the most complex ones I am aware off. It makes rocket science look dull. However, I have quite some experience in the relation chemical structure to properties and with knowledge integration, which is a prerequisite for understanding that relation. Anything I do does not say what the right course of action is. Any new piece of knowledge (or technology) has pros and cons. It is science that provides the evidence to support finding the right balance. It is science I focus on.

The case
The AD national newspaper reported spilling of the compound with the name GenX in the environment and reaching drinking water. This was picked up by other newspapers, like de VK. The chemistry news outlet C2W commented on the latter on Twitter:

Translated, the tweet reports that we do not know if the compound is dangerous. Now, to me, there are then two things: first, any spilling should not happen (I know this is controversial, as people are more than happy to repeatedly pollute the environment, just because of self-interest and/or laziness); second, what do we know about the compound? In fact, what is GenX even? It certainly won't be "generation X", though we don't actually know the hazard of that either. (We have IUPAC names, but just like with the ACS disclosures, companies like to make up cryptic names.)

But having working on predictive toxicology and data integration projects around toxicology, and for just having a chemical interest, I started out searching what we know about this compound.

Of course, I need an open notebook for my science, but I tend to be sloppy and mix up blog posts like this, with source code repositories, and public repositories. For new chemicals, as you could read earlier this weekend, Wikidata is one of my favorites (see also doi:10.3897/rio.1.e7573). Using the same approach as for the disclosures, I checked if Wikidata had entries for the ammonium salt and the "active" ingredient FRD-903 (fairly, chemically they are different, and so may their hazard and risk profiles). Neither existed, so I added them using Bioclipse and QuickStatements (a wonderful tool by Magnus Manke): GenX and FRD-903. So, a seed of knowledge was planted.
    A side topic... if you have not looked at yet, please do. It allows you to annotate (yes, there are more tools that allow that, but I like this one), which I have done for the VK article:

I had a look around on the web for information, and there is not a lot. A Wikidata page with further identifiers then helps tracking your steps. Antony Williams, previous of ChemSpider fame, now working on the EPA CompTox Dashboard, added the DTX substance IDs, but the entries in the dashboard will not show up for another bit of time. For FRD-903 I found growth inhibition data in ChEMBL.

But Nina Jeliazkova pointed me to her LRI AMBIT database (poster abstract doi:10.1016/j.toxlet.2016.06.1469, PDF links) that makes (public) data from ECHA available from REACH dossiers in a machine readable way (see this ECHA press release), using their AMBIT software (doi:10.1186/1758-2946-3-18). (BTW, this makes the legal hassle Hartung had last year even more interesting, see doi:10.1038/nature.2016.19365). After creation of a free login, you can find a full (public) dossier with information about the toxicology of the compound (toxicity, ecotoxicity, environmental fate, and more):

I reported this slide, as they worry seems to be about drinking water, so, oral toxicity seems appropriate (note, this is only acute toxicity). The LD50 is the median lethal dose, but is only measured for mouse and rat (these are models for human toxicity, but only models, as humans are just not rats; well, not literally, anyway). Also, >1 gram per kilogram body weight ("kg bw"; assumption) seems pretty high. In my naive understand, the rat may be the canary in the coal mine. But let me refrain from making any conclusions. I leave that to the experts on risk management!

Experts like those from the Dutch RIVM, which wrote up this report. One of the information they say is missing is that of biodistribution: "waar het zich ophoopt", or in English, where the compound accumulates.

The ACS Spring disclosures of 2017 #1

At the American Chemical Society meetings drug companies disclose recent new drugs to the world. Normally, the chemical structures are already out in the open, often as part of patents. But because these patents commonly discuss many compounds, the disclosures are a big thing.

Now, these disclosure meetings are weird. You will not get InChIKeys (see doi:10.1186/s13321-015-0068-4) or something similar. No, people sit down with paper, manually redraw the structure. Like Carmen Drahl has done in the past. And Bethany Halford has taken over that role at some point. Great work from both! The Chemical & Engineering News has aggregated the tweets into this overview.

Of course, a drug structure disclosure is not complete if it does not lead to deposition in databases. The first thing is to convert the drawings into something machine readable. And thanks to the great work from John May on the Chemistry Development Kit and the OpenSMILES team, I'm happy with this being SMILES. So, we (Chris Southan and me) started a Google Spreadsheet with CCZero data:

I drew the structures in Bioclipse 2.6.2 (which has CDK 1.5.13) and copy-pasted the SMILES and InChIKey into the spreadsheet. Of course, it is essential to get the stereochemistry right. The stereochemistry of the compounds was discussed on Twitter, and we think we got it right. But we cannot be 100% sure. For that, it would have been hugely helpful if the disclosures included the InChIKeys!

As I wrote before, I see Wikidata as a central resource in a web of linked chemical data. So, using the same code I used previously to add disclosures to Wikidata, I created Wikidata items for these compounds, except for one that was already in the database (see the right image). The code also fetches PubChem compound IDs, which are also listed in this spreadsheet.

The Wikidata IDs link to the SQID interface, giving a friendly GUI, one that I actually brought up before too. That said, until people add more information, it may be a bit sparsely populated:

But others are working on this series of disclosures too, and keep an eye on this blog post, as others may follow up with further information!

Closed access book chapters, Bookmetrix, and job creations

Enjoying my Saturday morning (you'll can actually track down that I write more blog posts then, than any other time of the week) with a coffee (no, not beer, Christoph). Wanted to complete my Scholia profile (gree work by Finn, arxiv:1703.04222, happy to have contributes ideas and small patches) a bit more (or perhaps that of the Journal of Cheminformatics), as that relaxes me, and nicely complements rerunning some Bioclipse scripts to add metabolite/compound data to Wikidata (e.g. this post). Because this afternoon I want to do some serious work, like write up outlines for a few cool grant applications. And if lucky, I may be able to do a bit of work on this below-the-radar project.

So, I started updating a full work available at for a peer-reviewed IEEE paper (doi:10.1109/BIBM.2014.6999367), as it is not old Open Access, and I have to rely on green Open Access. Then I headed over to my ImpactStory profile and ran into a closed Open Access book chapter with Tony, Sean, and Ola (doi:10.1007/978-1-62703-050-2_10). But I have no idea if I can put online a green Open Access version of this book chapter.

Now, why I am blogging this (and meanwhile, adding four new DTXSIDs to Wikidata), is two observiations. First, I had not blogged about Bookmetrix yet, a cool project that reports the impact of book chapters. The ROI on writing book chapters I always considered as not so high, but then I saw the #altmetrics for this chapter:

Five citations is not that lot, but considering I do not cite book chapter much either. But look at that number of downloads, 2.39 thousand! Wow!

But there is another angle to that. We regularly report our societal impact, nowadays. It's part of the Dutch Standard Evaluation Protocol, or at least selected by our research institute as something to assess researchers on. Hang on, no, citations is not part of that category. But this is: the paper is sold for about 50 euro. Seriously? Yes, seriously. And apparently 2.39K people bought this chapter. I am not sure if I need to assume that this is mostly people buying the full book, which means the chapter is a lot cheaper. But the full book reports download numbers of above 50 thousand, so it seems not. Now, let's assume that a good part of the bought copies is via package deals and the average payment is half. That may sound high, but we ignore the 50k download for the full book to compensate for that.

Doing that math means that our joint book chapter contributed 60k euro to the European market. That's a full job the four of us created with this single book chapter. I'm impressed.

EPA CompTox Dashboard IDs in Wikidata

After Antony Williams left the ChemSpider team, he moved on to the EPA. Since then, he has set up the EPA CompTox Dashboard (see also doi:10.1007/s00216-016-0139-z [€]). And in August he was kind enough to upload mappings between InChIKeys (doi:10.1186/s13321-015-0068-4) and their identifiers on Figshare (doi:10.6084/m9.figshare.3578313.v1) as a tab-separated values (TSV) file. Because this database is of interest to our pathway and systems biology work, I realized I wanted ID-ID mappings in our BridgeDb identifier mappings files (doi:10.1186/1471-2105-11-5). As I wrote earlier, I have adopted Wikidata (doi:10.3897/rio.1.e7573) as data source. So, entering these new identifiers in Wikidata is helpful.

Somewhere in the past few months I proposed the needed Wikidata property, P3117 ("DSSTOX substance identifier"), which was approved some time later. For entering the mappings, I have opted to write a Bioclipse script (doi:10.1186/1471-2105-10-397) that uses the Wikidata SPARQL endpoint to get about 150 thousand Wikidata item identifiers (Q-codes) and their InChIKeys. I then parses over the lines in the TSV file from Figshare and creates input for Wikidata for each match, based on exact InChIKey string equivalence.

This output is formatted QuickStatements instructions, a great tool set up by Magnus Manske. Each line looks like (here for N6-methyl-deoxy-adenosine-5'-monophosphate, aka Q27456455):

Q27456455 P3117 "DTXSID30678817" S248 Q28061352

The P248 ("stated in") property is used to link the source (hence: S248) information as reference, with points to the Q28061352 item which is for the Figshare entry for Tony's mapping data. The result in this Wikidata item looks like:

I entered about 36 thousand of such statements to Wikidata. Thus, the yield is about 5%, calculating from the CompTox Dashboard as starting point with about 720 thousand identifiers. From a Wikidata perspective, the yield is higher. There are about 150 thousand items with an InChIKey, so that 24% could be mapped.

Based on properties of the property, it does some automatic validation. For example, it is specified that any Wikidata item can only have one DSSTOX substance identifier, because it can only have one InChIKey too. Similarly, there can not be two Wikidata items with the same DSSTOX identifier. Normally, because because of how Wikidata works, there can be isolated examples. With less then 25 constraint violations, the quality of the process turned out pretty high (>99.9%).

Some of the issues have been manually inspected. Causes vary. One issue was that the Wikidata item in fact had more than one InChIKey. A possible reason for that is that it does not distinguish between various forms of a compound. Two Wikidata items have been split up accordingly. Other problems are due to features of the CompTox Dashboard, and some issues have been tweeted to the Dashboard team.

This mashup of these two resources, as anticipated in our H2020 proposal (doi:10.3897/rio.1.e7573), makes it possible to easily make slices of data. For example, we can query for experimental data for compounds in the EPA CompTox Dashboard with a SPARQL query like for the dipole moment:

Importantly, this query shows the source where this data comes from, one of the advantages of Wikidata.

OpenTox Euro 2016: "Data integration with identifiers and ontologies"

Results from a project by MSP students.
J. Windsor et al. (2016): Volatile Organic Compounds:
A Detailed Account of Identity, Origin,
Activity and Pathways
. Figshare.
A few weeks ago OpenTox Euro 2016 meeting was held in Rheinfelden at the German/Swiss border (which allowed me a nice stroll across the Rhine into Switzerland and by a nice x-mas countdown clock. The meeting was co-located with eNanoMapper-hosted meetings, where we discussed, among other things the nanoinformatics roadmaps, that outline where research in this area should go to.

There were many interesting talks, around various data initiatives, adverse outcome pathways (AOPs) and their links to molecular initiating events (MIEs), and ontologies (like the AOP ontology talk by ). In fact, I quite enjoyed the discussion with Chris Grulke about ontologies during the panel discussion. Central was, where is the border between data and ontological concepts. Some slides are available via Lanyrd.

During the Emerging Methods and Practice session hosted by Ola Spjuth, I presented the work at the BiGCaT department into identifier mapping and the use of ontologies for linking data sets.

Data integration with identifiers and ontologies from Egon Willighagen

The presentation integrates a lot of things I have been working on in the last few years, and please note the second slide with all people I have worked with on things presented in these slides.

New paper: "SPLASH, a hashed identifier for mass spectra"

I'm excited to have contributed to this important (IMHO) interoperability paper around metabolomics data: "SPLASH, a hashed identifier for mass spectra" (doi:10.1038/nbt.3689, readcube:msZj). A huge thanks to all involved in the great collaborative project! The source code project is fully open source and coordinated by Gert Wolgemuth, the lead author on this paper. It provides an implementation of the algorithm in various programming languages and I'm happy that the splash functionality is available in the just released Bioclipse 2.6.2 (taking advantage of the Java library). An R package by Steffen Neumann is also available.

This new identifier greatly simplifies linking between spectral databases and will in the end contribute to a Linked Data network. Furthermore, journals can start adopting this identifier and list the 'splash' for mass spectra in document, allowing for simplified dereplication and finding additional information around spectra.

There are several databases that have adopted the SPLASH already, such as MassBank, HMDB, MetaboLights, and the OSDB published in JCheminf recently (doi:10.1186/s13321-016-0170-2).

Screenshot snippet of a spectrum in the OSDB.

PS. I personally don't like the idea of ReadCubes (which I may blog about at some point) and how they have been pitched as a "legal" way of sharing papers, but this journal does not have a gold Open Access option, unfortunately.

Wohlgemuth, G., Mehta, S. S., Mejia, R. F., Neumann, S., Pedrosa, D., Pluskal, T., Schymanski, E. L., Willighagen, E. L., Wilson, M., Wishart, D. S., Arita, M., Dorrestein, P. C., Bandeira, N., Wang, M., Schulze, T., Salek, R. M., Steinbeck, C., Nainala, V. C., Mistrik, R., Nishioka, T., Fiehn, O., Nov. 2016. SPLASH, a hashed identifier for mass spectra. Nature Biotechnology 34 (11), 1099-1101.

Comparing sets of identifiers: the Bioclipse implementation

Source: Wikipedia
The problem
That sounds easy: take two collection of identifiers, put them in sets, determine the intersection, done. Sadly, each collection uses identifiers from different databases. Worse, within one set identifiers from multiple databases. Mind you, I'm not going full monty, though some chemistry will be involved at some point. Instead, this post is really based on identifiers.

The example
Data set 1:

Data set 2: all metabolites from WikiPathways. This set has many different data sources, and seven provide more than 100 unique identifiers. The full list of metabolite identifiers is here.

The goal
Determine the interaction of two collections of identifiers from arbitrary databases, ultimately using scientific lenses. I will develop at least two solutions: one based on Bioclipse (this post) and one based on R (later).

First of all, we need something that links IDs in the first place. Not surprisingly, I will be using BridgeDb (doi:10.1186/1471-2105-11-5) for that, but for small molecules alternatives exist, like the Open PHACTS IMS based on BridgeDb, the Chemical Translation Service (doi:10.1093/bioinformatics/btq476) or UniChem (doi:10.1186/s13321-014-0043-5, doi:10.1186/1758-2946-5-3).

The Bioclipse implementation
The first thing we need to do is read the files. I have them saved as CSV even though it is a tab-separated file. Bioclipse will now open it in it's matrix editor (yes, I think .tsv needs to be linked to that editor, which does not seem to be the case yet). Reading the human metabolites from WikiPathways is done with this code (using Groovy as scripting language):

file1 = new File(
    "/Compare Identifiers/human_metabolite_identifiers.csv"
set1 = new java.util.HashSet();
file1.eachLine { line ->
  fields = line.split(/\t/)
  def syscode;
  def id;
  if (fields.size() >= 2) {
    (syscode, id) = line.split(/\t/)
  if (syscode != "syscode") { // ok, not the first line
    set1.add(bridgedb.xref(id, syscode))

You can see that I am using the BridgeDb functionality already, to create Xref objects. The code skips the first line (or any line with "column headers"). The BridgeDb Xref object's equals() method ensures I only have unique cross references in the resulting set.

Reading the other identifier set is a bit trickier. First, I manually changed the second column, to use the BridgeDb system codes. The list is short, and saves me from making mappings in the source code. One thing I decide to do in the source code is normalize the ChEBI identifiers (something that many of you will recognize):

file2 = new File(
  bioclipse.fullPath("/Compare Identifiers/set.csv")
set2 = new java.util.HashSet();
file2.eachLine { line ->
  fields = line.split(/\t/)
  def name;
  def syscode;
  def id;
  if (fields.size() >= 3) {
    (name, syscode, id) = line.split(/\t/)
  if (syscode != "syscode") { // ok, not the first line
    if (syscode == "Ce") {
      if (!id.startsWith("CHEBI:")) {
        id = "CHEBI:" + id
    set2.add(bridgedb.xref(id, syscode))

Then, the naive approach that does not take into account identifier equivalence makes it easy to list the number of identifiers in both sets:

intersection = new java.util.HashSet();

println "set1: " + set1.size()
println "set2: " + set2.size()
println "intersection: " + intersection.size()

This reports:

set1: 2584
set2: 6
intersection: 3

With the following identifiers in common:

[Ce:CHEBI:30089, Ce:CHEBI:15904, Ca:25513-46-6]

Of course, we want to use the identifier mapping itself. So, we first compare identifiers directly, and if not matching, use BridgeDb and an metabolite identifier mapping database (get one here):

mbMapper = bridgedb.loadRelationalDatabase(

intersection = new java.util.HashSet();
for (id2 in set2) {
  if (set1.contains(id2)) {
    // OK, direct match
  } else {
    mappings =, id2)
    for (mapped in mappings) {
      if (set1.contains(mapped)) {
        // OK, direct match

This gives five matches:

[Ch:HMDB00042, Cs:5775, Ce:CHEBI:15904, Ca:25513-46-6, Ce:CHEBI:30089]

The only metabolite it did not find in any pathway is the KEGG identified metabolite, homocystine. I just added this compound to Wikidata. That means that in the next metabolite mapping database, it will recognize this compound too.

The R and JavaScript implementations
I will soon write up the R version in a follow up post (but got to finish grading student reports first).

Splitting up Bioclipse Groovy scripts

Source: Wikipedia, CC-BY-SA 3.0
... without writing additional script managers (see doi:10.1186/1471-2105-10-397). That was what I was after. I found that by using evaluate() you could load additional code. Only requirements, you wrap stuff in a class, and the filename need to match the class name. So, you put stuff in a class SomeName and safe that in a Bioclipse project (e.g. SomeProject/) with the name SomeName.groovy.

That is, I have this set up:


Then, in this aScript.groovy you can include the following code to load that class and make use of the content:

  someClass = evaluate(
    new File(

Maybe there are even better ways, but this works for me. I tried the regular Groovy way of instantiating a class defined like this, but because the Bioclipse Groovy environment does not have a working directory, I could not get that to work.

Migrating pKa data from DrugMet to Wikidata

In 2010 Samuel Lampa and I started a pet project: collecting pKa data: he was working on RDF extension of MediaWiki and I like consuming RDF data. We started DrugMet. When you read this post, this MediaWiki installation may already be down, which is why I am migrating the data to Wikidata. Why? Because data curation takes effort, I like to play with Wikidata (see this H2020 proposal by Daniel Mietchen et al.), I like Open Data (see ), and it still much needed.

We opted for a page with the minimal amount of information. To maximize the speed at which we could add information. However, when it came to semantics, we tried to be as explicit as possible, and, e.g. use the CHEMINF ontology. So, it collected:
  1. InChIKey (used to show images)
  2. the paper it was collected from (identified by a DOI)
  3. the value, and where possible, the experimental error
A page typically looks something like this:

While not used on all pages, at some point I even started using templates, and I used these two, for molecules and papers:



These templates, as well as the above screenshot, already contain a spoiler, but more about that later. Using MediaWiki functionality it was now easy to make lists, e.g. for all pKa data (more spoilers):

I find a database like this very important. It does not capture all the information it should be capturing, though, as is clear from the proposal some of use worked on a while back. However, this project got on hold; I don't have time for it anymore, and it is not core to our department enough to spend time on write grant proposals for it.

But I still do not want to get this data get lost. Wikidata is something I have started using, as it is a machine readable CCZero database with an increasing amount of scientific knowledge. More and more people are working on it, and you must absolutely read this paper about this very topic (by a great team you should track, anyway). I am using it myself as source of identifier mappings and more. So, migrating the previously collected data to Wikidata makes perfect sense to me:

  1. if a compound is missing, I can easily create a new one using Bioclipse
  2. if a paper is missing, I can easily create a new one using Magnus Manske's QuickStatements
  3. Wikidata has a pretty decent provenance model
I can annotate data with the data source (paper) it came from and also experimental conditions:

In fact, you'll note that the the book is a separate Wikidata entry in itself. Better even, it's an 'edition' of the book. This is the whole point we make in the above linked H2020 proposal: Wikidata is not a database specific for one domain, it works for any (scholarly) domain, and seamlessly links all those domains.

Now, to keep track of what data I have migrated, I am annotating DrugMet entries with links to Wikidata: everything with a Wikidata Q-code is already migrated. The above pKa table already shows Q-identifiers, but I also created them for all data sources I have used (three of them are two books and one old paper without a DOI):

I have still quite a number of entries to do, but all the protocols are set up now.

On the downstream side, Wikidata is also great because of their SPARQL end point. Something that I did not get worked out some weeks ago, I did manage yesterday (after some encouragement from @arthursmith): list all pKstatements, including literature source if available:

If you run that query on the Wikidata endpoint, you get a table like this:

We here see experimental data from two papers: 10.1021/ja01489a008 and 10.1021/ed050p510. This can all be displayed a lot fancier, like make histograms, tables with 2D drawings of the chemical structures, etc, but I leave that to the reader.