CC-BY with the ACS Author Choice: CDK and Blue Obelisk papers liberated

Screenshot of an old CDK-based
JChemPaint, from the first CDK paper.
CC-BY :)
Already a while ago, the American Chemical Society (ACS) decided to allow the Creative Commons Attribution license (version 4.0) to be used on their papers, via their Author Choice program. ACS members pay $1500, which is low for a traditional publisher. While I even rather seem them move to a gold Open Access journal, it is a very welcome option! For the ACS business model it means a guaranteed sell of some 40 copies of this paper (at about $35 dollar each), because it will not immediately affect the sale of the full journal (much). Some papers may sell more than that had the paper remained closed access, but many for papers that sounds like a smart move money wise. Of course, they also buy themselves some goodwill and green Open Access is just around the corner anyway.

Better, perhaps, is that you can also use this option to make a past paper Open Access under a CC-BY license! And that is exactly what Christoph Steinbeck did with five of his papers, including two on which I am co-author. And these are not the least papers either. The first is the first CDK paper from 2003 (doi:10.1021/ci025584y), which featured a screenshot of JChemPaint shown above. Note that in those days, the print journal was still the target, so the screenshot is in gray scale :) BTW, given that this paper is cited 329 times (according to ImpactStory), maybe the ACS could have sold more than 40 copies. But for me, it means that finally people can read this paper about Open Science in chemistry, even after so many years. BTW, there is little chance the second CDK paper will be freed in a similar way.

The second paper that was liberated this way, is the first Blue Obelisk paper (doi:10.1021/ci050400b), which was cited 276 times (see ImpactStory):

This screenshot nicely shows how readers can see the CC-BY license for this paper. Note that it also lists that the copyright is with the ACS, which is correct, because in those days you commonly gave away your copyright to the publisher (I have stopped doing this, bar some unfortunate recent exceptions).

So, head over to your email client and email and let them know you also want your JCICS/JCIM paper available under a CC-BY license! No excuse anymore to make your seminal work in cheminformatics not available as gold Open Access!

Of course, submitting your new work to the Journal of Cheminformatics is cheaper and has the advantage that all papers are Open Access!

Postdoc in Bioinformatics for Anti-Obesity Strategy

Courtesy of Frank Genten

Courtesy of Frank Genten

The Steinbeck Group at European Bioinformatics Institute (EMBL-EBI, Cambridge, UK), together with the Lab of Tony Vidal-Puig (U. Cambridge/ WT Sanger), are excited to announce a joint opening for a Post-doc position to work on a multi-omics project to identify new players in the human brown and beige adipocyte recruitment as an anti-obesity strategy.  The project involves data analysis of multi-omics data sets (metabolomics, transcriptomics, proteomics, among others) and integration of that data into different mathematical modelling frameworks (discrete logical models, kinetic ODE-based, FBA). With these models and data, the fellow will identify novel pharmaceutical strategies to induce BAT generation/WAT browning. The models will be used to evaluate in silico the potential effect of drugs on adipocytes. Finally, the best candidate molecules will be applied to the human pluripotent stem cell models to confirm their capacity to induce brown/beige adipogenesis in-vitro. Experiments will be performed with the support of experts at WTSI.

Experience in applying mathematical modelling techniques is desirable, as well as previous exposure to large data sets, but it is not expected of course that the candidate has expertise in all of the listed above, as training will be given on parts were the applicant has less experience. 

The EMBL-EBI is part of the European Molecular Biology Laboratory (EMBL) and it is a world-leading bioinformatics centre providing biological data to the scientific community with expertise in data storage, analysis and representation. EMBL-EBI provides freely available data from life science experiments, performs basic research in computational biology and offers an extensive user training programme, supporting researchers in the academic and industrial sectors.

EMBL-EBI and Wellcome Trust Sanger Institute share the Wellcome Genome Campus. This proximity fosters close collaborations and contributes to an international and vibrant campus environment. Researchers are supported by easy access to scientific expertise, well-equipped facilities and an active seminar programme.

The EMBL-EBI–Sanger Postdoctoral (ESPOD) Programme builds on the strong collaborative relationship between the two institutes, offering projects which combine experimental (wet lab) and computational approaches.

Please apply here:

I have absolutely no clue why this paper is citing the CDK...

I have absolutely no clue why this paper is citing the CDK...

Corrosion behaviors and effects of corrosion products of plasma electrolytic oxidation coated AZ31 magnesium alloy under the salt spray corrosion test

Screen reader users, click here to load entire articleThis page uses JavaScript to progressively load the article content as a user scrolls. Screen reader users, click the load entire article button to bypass dynamically loaded article content. Please note that Internet Explorer version 8.x will ...

Re: How should we add citations inside software?

Practice is that many cite webpages for the software, sometimes even just list the name. I do not understand why scholars do not en masse look up the research papers that are associated with the software. As a reviewer of research papers I often have to advice authors to revise their manuscript accordingly, but I think this is something that should be caught by the journal itself. Fact is, not all reviewers seem to check this.

In some future, if publishers would also take this serious, we will citation metrics for software like we have to research papers and increasingly for data (see also this brief idea). You can support this by assigning DOIs to software releases, e.g. using ZENODO. This list on our research group's webpage shows some of the software releases:

My advice for citation software thus goes a bit beyond what traditionally request for authors:

  1. cite the journal article(s) for the software that you use
  2. cite the specific software release version using ZENODO (or compatible) DOIs

 This tweet gives some advice about citing software, triggering this blog post:
Citations inside software
Daniel Katz takes a step further and asked how we should add citations inside software. After all, software reuses knowledge too, stands on algorithmic shoulders, and this can be a lot. This is something I can relate to a lot: if you write a cheminformatics software library, you use a ton of algorithms, all that are written up somewhere. Joerg Wegner did this too in his JOELib, and we adopted this idea for the Chemistry Development Kit.

So, the output looks something like:

(Yes, I spot the missing page information. But rather than missing information, it's more that this was an online only journal, and the renderer cannot handle it well. BTW, here you can find this paper; it was my first first author paper.)

However, at a Java source code level it looks quite different:

The build process is taking advantage of the JavaDoc taglet API and uses a BibTeXML file with the literature details. The taglet renders it to full HTML as we saw above.

Bioclipse does not use this in the source code, but does have the equivalent of a CITATION file: the managers, that extend the Python, JavaScript, and Groovy scripting environments with domain specific functionality (well, read the paper!). You can ask in any of these scripting languages about citation information:

    > doi bridgedb

This will open the webpage of the cited article (which sometimes opens in Bioclipse, sometimes in an external browser, depending on how it is configured).

At a source code level, this looks like:

So, here are my few cents. Software citation is important!

The quality of SMILES strings in Wikidata

Russian Wikipedia on tungsten hexacarbonyl.
One thing that machine readability adds, is all sorts of machine processing. Validation of data consistency is one. For SMILES strings, one of the things you can do is test of the string parses at all. Wikidata is machine readable, and, in fact, easier to parse than Wikipedia, for which the SMILES strings were validated recently in a J. Cheminformatics paper by Ertl et al. (doi:10.1186/s13321-015-0061-y).

Because I was wondering about the quality of the SMILES strings (and because people ask me about these things), I made some time today to run a test:
  1. SPARQL for all SMILES strings
  2. process each one of them with the CDK SMILES parser
I can do both easily in Bioclipse with an integrated script:

identifier = "P233" // SMILES
type = "smiles"

sparql = """
PREFIX wdt: <>
SELECT ?compound ?smiles WHERE {
  ?compound wdt:P233 ?smiles .
mappings = rdf.sparqlRemote("", sparql)

outFilename = "/Wikidata/badWikidataSMILES.txt"
if (ui.fileExists(outFilename)) ui.remove(outFilename)
fileContent = ""
for (i=1; i<=mappings.rowCount; i++) {
  try {
    wdID = mappings.get(i, "compound")
    smiles = mappings.get(i, "smiles")
    mol = cdk.fromSMILES(smiles)
  } catch (Throwable exception) {
    fileContent += (wdID + "," + smiles + ": " +

                   exception.message + "\n")
  if (i % 1000 == 0) js.say("" + i)
ui.append(outFilename, fileContent)

It turns out that out of the more than 16 thousand SMILES strings in Wikidata, only 42 could not be parsed. That does not mean they are correct, but it does mean the are wrong. Many of them turned out to be imported from the Russian Wikipedia, which is nice, as it gives me the opportunite to work in that Wikipedia instance too :)

At this moment, some 19 SMILES still need fixing (the list will chance over time, so by the time you read this...):

SWAT4LS in Cambridge

Wordle of the #swat4ls tweets.
Last week the BiGCaT team were present with three person (Linda, Ryan, and me) at the Sematic Web Applications and Tools 4 Life Sciences meeting in Cambridge (#swat4ls). It's a great meeting, particularly because if the workshops and hackathon. Previously, I attended the meeting in Amsterdam (gave this presentation) and Paris (which I apparently did not blog about).

I have mixed feelings about missing half of the workshops on Monday for a visit of one of our Open PHACTS partners, but do not regret that meeting at all; I just wish I could have done both. During the visit we spoke particularly about WikiPathways and our collaboration in this area.

The Monday morning workshops were cool. First, Evan Bolton and Gang Fu gave an overview of their PubChemRDF work. I have been involved in that in the past, and I greatly enjoyed seeing the progress they have made, and a rich overview of the 250GB of data they make available on their FTP side (sorry, the rights info has not gotten any clearer over the years, but generally considered "open"). The RDF now covers, for example, the biosystems module too, so that I can query PubChem for all compounds in WikiPathways (and compare that against internal efforts).

The second workshop I attended was by Andra and others about Wikidata. The room, about 50 people, all started editing Wikidata, in trade of a chocolate letter:

The editing was about prevalence is two diseases. Both topics continued during the hackathon, see below. Slides of this presentation are online. But I missed the DisGeNET workshop, unfortunately :(

The conference itself (in the new part of Clare College, even the conference dinner) started on the second day, and all presentations are backed by a paper, linked from the program. Not having attended a semantic web conference in the past 2~ish years, it was nice to see the progress in the field. Some papers I found interesting:
But the rest is most worthwhile checking out too! The Webulous I as able to get going with some help (not paying enough attention to the GUI) for eNanoMapper:

A Google Spreadsheet where I restricted the content of a set of cells to only subclasses of the "nanomaterial" class in the eNanoMapper ontology (see doi:10.1186/s13326-015-0005-5).
The conference ended with a panel discussion, and despite our efforts of me and the other panel members (Frank Gibson – Royal Society of Chemistry, Harold Solbrig – Mayo Clinic, Jun Zhao, University of Oxford), it took long before the conference audience really started joining in. Partly this was because the conference organization asked the community for questions, and the questions clearly did not resonate with the audience. It was not until we started discussing publishing that it became more lively. My point there was I believe the semantic web applications and tools are not really a rate limiting factor anymore, and if we really want to make a difference, we really must start changing the publishing industry. This has been said by me and others for many years already, but the pace at which things change it too low. Someone mentioned a chicken-and-egg situation, but I really believe it is all just a choice we make and an easy solution: pick up a knife, kill the chicken, and have a nice dinner. It is annoying to see all the great efforts at this conference, but much of it limited because our writing style makes nice stories and yields few machine readable facts.

The hackathon was held at the EBI in Hinxton (south/elixir building) and during the meeting I had a hard time deciding what to hack on: there just were too many interesting technologies to work on, but I ended up working on PubChem/HDT (long) and Wikidata (short). The timings are based on the amount of help I needed to bootstrap things and how much I can figure out at home (which is a lot for Wikidata).

HDT (header, dictionary, triple) is a not-so-new-but-under-the-radar technology for binary storing triples in a file based store. The specification outlines this binary format as well as the index. That means that you can share triple data compressed and indexed. That opens up new possibilities. One thing I am interested in, is using this approach for sharing link sets (doi:10.1007/978-3-319-11964-9_7) for BridgeDb, our identifier mapping platform. But there is much more, of course: share life science databases on your laptop.

This hack was triggered by a beer with Evan Bolton and Arto Bendiken. Yes, there is a Java library, hdt-java, and for me the easiest way to work out how to use a Java API, is to write a Bioclipse plugin. Writing the plugin is trivial, though setting up a Bioclipse development is less so: the New Wizard does the hard work in seconds. But then started the dependency hacking. The Jena version it depended on is incompatible with the version in Bioclipse right now, but that is not a big deal for Eclipse, and the outcome is that we have both version on the classpath :) That, however, did require me to introduce a new plugin, net.bioclipse.rdf.core with the IRDFStore, something I wanted to do for a long time, because that is also needed if one wants to use Sesame/OpenRDF instead of Jena.

So, after lunch I was done with the code cleanup, and I got to the HDT manager again. Soon, I could open a HDT file. I first had the API method to read it into memory, but that's not what I wanted, because I want to open large HDT files. Because it uses Jena, it conveniently provides a Jena Model object, so adding SPARQL-ing support was easy; I cannot use the old SPARQL-ing code, because then I would start mixing Jena versions, but since all is Open Source, I just copied/pasted the code (which is written by me in the first place, doi:10.1186/2041-1480-2-s1-s6, interestingly, work that originates from my previous SWAT4LS talk :). Then, I could do this:
It is file based, which has different from a full triple store server. So, questions arise about performance. Creating an index takes time and memory (1GB of heap space, for example). However, the index file can be shared (downloaded) and then a HDT file "opens" in a second in Bioclipse. Of course, the opening does not do anything special, like loading into memory, and should be compared to connecting to a relational database. The querying is what takes the time. Here are some numbers for the Wiktionary data that the RDFHDT team provides as example data set:
However, I am not entirely sure what to compare this against. I will have to experiment with, for example, ChEMBL-RDF (maybe update the Uppsala version, see doi:10.1186/1758-2946-5-23). The advantage would be that ChEMBL data could easily be distributed along with Bioclipse to service the decision support features. Because the typical query is asking for data for a parcicular compound, not all compounds. If that works within less than 0.1 seconds, then this may give a nice user experience.

But before I reach that, it needs a bit more hacking:
  1. take the approach I took with BridgeDb mapping databases for sharing HDT files (which has the advantage that you get a decent automatic updating system, etc)
  2. ensure I can query over more than one HDT file
And probably a bit more.

Wikidata and WikiPathways
After the coffee break I joined the Wikidata people, and sat down to learn about the bots. However, Andra wanted to finish something else first, where I could help out. Considering I probably manage to hack up a bot anyway, we worked on the following. Multiple database about genes, proteins, and metabolites like to link these biological entities to pathways in WikiPathways (doi:10.1093/nar/gkv1024). Of course, we love to collaborate with all the projects that integrate WikiPathways into their systems, but I personally rather use a solution that services all needs. If only because then people can do this integration without needing our time. Of course, this is an idea we pitched about a year ago in the Enabling Open Science: WikiData for Research proposal (doi:10.5281/zenodo.13906).

That is, would it not be nice of people can just pulled the links between the biological entities to WikiPathways from Wikidata, using one of the many APIs they have (SPARQL, REST), supporting multiple formats (XML, JSON, RDF)? I think so, as you might have guessed. So does Andra, and he asked me if I could start the discussions in the Wikidata community, which I happily did. I'm not sure about the outcome, because despite having links like these is not of their prime interest - they did not like the idea of links to the Crystallography Open Database much yet, with the argument it is a one-to-many relation - though this is exactly what the PDB identifier is too, and that is accepted. So, it's a matter of notability again. But this is what the current proposal looks like:

Let's see how the discussion unfolds. Please feel tree to coin in and show your support, comments, questions, or opposition, so that we can together get this right.

Chemistry Development Kit
There is undoubtedly a lot more, but I have been summarizing the meeting for about three hours now, getting notes together etc. A last thing I want to mention now, is the CDK. Cheminformatics is, afterall, a critical feature of life science data, and spoke with a few about the CDK. And I visited NextMove Software on Friday where John May works nowadays, who did a lot of work on the CDK recently (we also spoke about WikiPathways and eNanoMapper). NextMove is doing great stuff (thanks for the invitation), and so did John during his PhD in Chris Steinbeck's group at the EBI. But during the conference I also spoke with others about the CDK and following up on these conversations.

Bringing Molfile Sgroups to the CDK - Rendering Tips

In the last but one post I gave a demonstration of S(ubstructure)group rendering in the CDK. Now I want to give some implementation insights.

Abbreviations (Superatoms)

Abbreviations contract part of a structure to a line formula or common abbreviation.


Abbreviating too much or using unfamiliar terms (e.g. maybe using CAR for carbazole) can make a depiction worse. However some structures, like CHEMBL590010, can be dramatically improved.


One way to implement abbreviations would be by modifying the molecule data structure with literal collapse/contract and expand operations. Whilst this approach is perfectly reasonable, deleting atoms/bonds is expensive (in most toolkits) and it somewhat subtracts the "display shortcut" nature of this Sgroup.

For efficiency abbreviations are implemented by hiding parts of the depictions and remapping symbols. Just before rendering we iterator over the Sgroups and set hint flags that these atoms/bonds should not be included in the final image. If there is one attachment (e.g. Phenyl) we remap the attach point symbol text to the abbreviation label ('C'->'Ph'). When there are no attachments (e.g. AlCl3) we add a new symbol to the centroid of the hidden atoms.

Hide atoms and bonds Symbol Remap Abbreviated Result

For two or more attachments (e.g. SO2) you also need coordinate remapping.

Multiple Group

Multiple groups allow, contraction of a discrete number of repeating units. They are handled similarly to the abbreviations except we don't need to remap parts.


All atoms are present in the data structure but are laid out on top of each other (demonstrated below). We have a list of parent atoms that form the repeat unit. Therefore to display multiple groups we hide all atoms and bonds in the Sgroup except for parent atoms and the crossing bonds.

It's worth mentioning that hidden symbols are still generated but simply excluded from the final result. This allows bond back off for hetero atoms to be calculated correctly as is seen in this somewhat tangled example:


Polymer and Multiple group Sgroups require rendering of brackets. Encoded in the molfile (and when laid out) brackets are described by two points, a line. It is therefore up to the renderer to decide which side of the line the tick marks should face.

I've seen some implementations use the order of the points to convey bracket direction. Another method would be to point the brackets at each other. As shown for CHEBI:59342 this is not always correct.

Poor bracket direction Preferred bracket direction

I originally thought the solution might involve a graph traversal/flood-fill but it turns out there is a very easy way to get the direction correct. First we consider that brackets may or may not be placed on bonds, if a bracket is on a bond this information is available (crossing bonds).

  • For a bracket on a crossing bond exactly one end atom will be contained in the Sgroup, the bracket should point towards this atom.
  • If a bracket doesn't cross a bond then the direction should point to the centroid of all atoms in the Sgroup.

Internship in Bioinformatics/Cheminformatics

CaffeineWe are looking for a candidate for an internship/trainee position in bioinformatics/cheminformatics at the European Bioinformatics Institute (EMBL-EBI) to work on a high-performance generator for chemical structures. This position requires strong programming skills in Java, a reasonable working knowledge in chemical structures and graph theory as well as an interest in learning about Apache Hadoop and related technologies.
The initial contract will be for 6 month with a monthly internship salary of £800.

Please send your application to

So, now you have SMILES that are faulty... visualize them?

So, you validated your list of SMILES in the paper you were planning to use (or about to submit), and you found a shortlist of SMILES strings that do not look right. Well, let's visualize them.

We all used to use the Daylight Depict tool, but this is no longer online. I blogged previously already about using AMBIT for SMILES depiction (which uses various tools for depiction; doi:10.1186/1758-2946-3-18), but now John May released a CDK-only tool, called CDK Depict. The download section offers a jar file and a war for easy deployment in a Tomcat environment. But for the impatient, there is also this online host where you can give it a try (it may go offline at some point?).

Just copy/paste your shortlist there, and visually see what is wrong with them :) Big HT to John for doing all these awesome things!

How to test SMILES strings in Supplementary Information

Source. License: CC-BY 2.0.
When you stumble upon a nice paper describing a new predictive or explanatory model for a property or a class of compounds that has your interest, the first thing you do is test the training data. For example, validating SMILES (or OpenSMILES) strings in such data files is now easy with the many Open Source tools that can parse SMILES strings: the Chemistry Toolkit Rosetta provides many pointers for parsing SMILES strings. I previously blogged about a CDK/Groovy approach.

Cheminformatics toolkits need to understand what the input is, in order to correctly calculate descriptors. So, let's start there. It does not matter so much which toolkit you use and I will use the Chemistry Development Kit (doi:10.1021/ci025584y) here to illustrate the approach.

Let's assume we have a tab-separated values file, with the compound identifier in the first column and the SMILES in the second column. That can easily be parsed in Groovy. For each SMILES we parse it and determine the CDK atom types. For validation of the supplementary information we only want to report the fails, but let's first show all atom types:

import org.openscience.cdk.smiles.SmilesParser;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
import org.openscience.cdk.atomtype.CDKAtomTypeMatcher;

parser = new SmilesParser(
matcher = CDKAtomTypeMatcher.getInstance(

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") { // header line
    mol = parser.parseSmiles(smiles)
    println "$id -> $smiles";

    // check CDK atom types
    types = matcher.findMatchingAtomTypes(mol);
    types.each { type ->
      if (type == null) {
        report += "  no CDK atom type\n"
      } else {
        println "  atom type: " + type.atomTypeName

This gives output like:

mo1 -> COC
  atom type: C.sp3
  atom type: O.sp3
  atom type: C.sp3

If we rather only report the errors, we make some small modifications and do something like:

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") {
    mol = parser.parseSmiles(smiles)
    errors = 0
    report = ""

    // check CDK atom types
    types = matcher.findMatchingAtomTypes(mol);
    types.each { type ->
      if (type == null) {
        errors += 1;
        report += "  no CDK atom type\n"

    // report
    if (errors > 0) {
      println "$id -> $smiles";
      print report;

Alternatively, you can use the InChI library to do such checking. And here too, we will use the CDK and the CDK-InChI integration (doi:10.1186/1758-2946-5-14).

factory = InChIGeneratorFactory.getInstance();

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") {
    mol = parser.parseSmiles(smiles)

    // check InChI warnings
    generator = factory.getInChIGenerator(mol);
    if (generator.returnStatus != INCHI_RET.OKAY) {
      println "$id -> $smiles";
      println generator.message;

The advantage of doing this, is that it will also give warnings about stereochemistry, like:

mol2 -> BrC(I)(F)Cl
  Omitted undefined stereo

I hope this gives you some ideas on what to do with content in supplementary information of QSAR papers. Of course, this works just as well for MDL molfiles. What kind of validation do you normally do?