Adding chemical compounds to Wikidata

(write up in progress)

Adding chemical compounds to Wikidata is not difficult. You can store the chemical formula (P274), (canonical) SMILES (P233), InChIKey (P235) (and InChI (P234), of course), as well various database identifiers (see what I wrote about that here). It also allows storing of the provenance, and has predicates for that too.

So, to enter a new structure for a compound, you should enter the compound information to Wikidata. Of course, make sure to create the needed accounts, particularly one for Wikidata (create account) (not sure if the next steps needs a more general Wikimedia account too).

Entering the research paper
Magnus Manske pointed me to this helper tool. If you have the DOI of the paper, it is easy to add a new paper. This is what the tool shows for doi:10.1128/AAC.01148-08 (but no longer when you try!):


You need permission to run this script and the tool will alert you about that, and give the instructions how to get permission. After I clicked the Open in QuickStatements I get this output, showing me an entry in Wikidata was created for this paper:


Later, I can use the new Q-code (Q22309806) to use as source for statements about the compound (formula, etc).

Draw your compound and get an InChIKey
The next step is to draw a compound and get an InChIKey. This can be done with many tools, including Bioclipse. Rajarshi opted for alternatives:

Then check if the compound is not already in Wikidata. You can use this SPARQL query for that using the InChIKey of the compound (it's for acetic acid, so it will be found):


For convenience, here the copy/pastable SPARQL:
    PREFIX wdt: 
    SELECT ?compound WHERE {
    ?compound wdt:P235 "QTBSBXVTEAMEQO-UHFFFAOYSA-N" .
    }
Entering the compound
So, the compound is not already in Wikidata, so time to add it. The minimal information you should provide is the following:
  • mark the new entry as 'instance of' (P) 'chemical compound (Q)
  • the chemical formula and SMILES (use as reference the paper)
    • add the reference to the paper you entered above
  • add the InChIKey and/or InChI
The first step is to create a new Wikidat entry. The Create new item menu in the left side panel can be used, showing a page like this:


As a label you can use the name used in the paper for the compound, even if a code, and as description 'chemical compound' will do for now; it can be changed later.
    Feel free to add as much information about the compound as you can find. There are some chemically rich entries in Wikidata, such as that for acetic acid (Q47512).

    The quality of SMILES strings in Wikidata

    Russian Wikipedia on tungsten hexacarbonyl.
    One thing that machine readability adds, is all sorts of machine processing. Validation of data consistency is one. For SMILES strings, one of the things you can do is test of the string parses at all. Wikidata is machine readable, and, in fact, easier to parse than Wikipedia, for which the SMILES strings were validated recently in a J. Cheminformatics paper by Ertl et al. (doi:10.1186/s13321-015-0061-y).

    Because I was wondering about the quality of the SMILES strings (and because people ask me about these things), I made some time today to run a test:
    1. SPARQL for all SMILES strings
    2. process each one of them with the CDK SMILES parser
    I can do both easily in Bioclipse with an integrated script:

    identifier = "P233" // SMILES
    type = "smiles"

    sparql = """
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    SELECT ?compound ?smiles WHERE {
      ?compound wdt:P233 ?smiles .
    }
    """
    mappings = rdf.sparqlRemote("https://query.wikidata.org/sparql", sparql)

    outFilename = "/Wikidata/badWikidataSMILES.txt"
    if (ui.fileExists(outFilename)) ui.remove(outFilename)
    fileContent = ""
    for (i=1; i<=mappings.rowCount; i++) {
      try {
        wdID = mappings.get(i, "compound")
        smiles = mappings.get(i, "smiles")
        mol = cdk.fromSMILES(smiles)
      } catch (Throwable exception) {
        fileContent += (wdID + "," + smiles + ": " +

                       exception.message + "\n")
      }
      if (i % 1000 == 0) js.say("" + i)
    }
    ui.append(outFilename, fileContent)
    ui.open(outFilename)

    It turns out that out of the more than 16 thousand SMILES strings in Wikidata, only 42 could not be parsed. That does not mean they are correct, but it does mean the are wrong. Many of them turned out to be imported from the Russian Wikipedia, which is nice, as it gives me the opportunite to work in that Wikipedia instance too :)

    At this moment, some 19 SMILES still need fixing (the list will chance over time, so by the time you read this...):

    SWAT4LS in Cambridge

    Wordle of the #swat4ls tweets.
    Last week the BiGCaT team were present with three person (Linda, Ryan, and me) at the Sematic Web Applications and Tools 4 Life Sciences meeting in Cambridge (#swat4ls). It's a great meeting, particularly because if the workshops and hackathon. Previously, I attended the meeting in Amsterdam (gave this presentation) and Paris (which I apparently did not blog about).

    Workshops
    I have mixed feelings about missing half of the workshops on Monday for a visit of one of our Open PHACTS partners, but do not regret that meeting at all; I just wish I could have done both. During the visit we spoke particularly about WikiPathways and our collaboration in this area.

    The Monday morning workshops were cool. First, Evan Bolton and Gang Fu gave an overview of their PubChemRDF work. I have been involved in that in the past, and I greatly enjoyed seeing the progress they have made, and a rich overview of the 250GB of data they make available on their FTP side (sorry, the rights info has not gotten any clearer over the years, but generally considered "open"). The RDF now covers, for example, the biosystems module too, so that I can query PubChem for all compounds in WikiPathways (and compare that against internal efforts).

    The second workshop I attended was by Andra and others about Wikidata. The room, about 50 people, all started editing Wikidata, in trade of a chocolate letter:

    The editing was about prevalence is two diseases. Both topics continued during the hackathon, see below. Slides of this presentation are online. But I missed the DisGeNET workshop, unfortunately :(

    Conference
    The conference itself (in the new part of Clare College, even the conference dinner) started on the second day, and all presentations are backed by a paper, linked from the program. Not having attended a semantic web conference in the past 2~ish years, it was nice to see the progress in the field. Some papers I found interesting:
    But the rest is most worthwhile checking out too! The Webulous I as able to get going with some help (not paying enough attention to the GUI) for eNanoMapper:

    A Google Spreadsheet where I restricted the content of a set of cells to only subclasses of the "nanomaterial" class in the eNanoMapper ontology (see doi:10.1186/s13326-015-0005-5).
    The conference ended with a panel discussion, and despite our efforts of me and the other panel members (Frank Gibson – Royal Society of Chemistry, Harold Solbrig – Mayo Clinic, Jun Zhao, University of Oxford), it took long before the conference audience really started joining in. Partly this was because the conference organization asked the community for questions, and the questions clearly did not resonate with the audience. It was not until we started discussing publishing that it became more lively. My point there was I believe the semantic web applications and tools are not really a rate limiting factor anymore, and if we really want to make a difference, we really must start changing the publishing industry. This has been said by me and others for many years already, but the pace at which things change it too low. Someone mentioned a chicken-and-egg situation, but I really believe it is all just a choice we make and an easy solution: pick up a knife, kill the chicken, and have a nice dinner. It is annoying to see all the great efforts at this conference, but much of it limited because our writing style makes nice stories and yields few machine readable facts.

    Hackathon
    The hackathon was held at the EBI in Hinxton (south/elixir building) and during the meeting I had a hard time deciding what to hack on: there just were too many interesting technologies to work on, but I ended up working on PubChem/HDT (long) and Wikidata (short). The timings are based on the amount of help I needed to bootstrap things and how much I can figure out at home (which is a lot for Wikidata).

    HDT (header, dictionary, triple) is a not-so-new-but-under-the-radar technology for binary storing triples in a file based store. The specification outlines this binary format as well as the index. That means that you can share triple data compressed and indexed. That opens up new possibilities. One thing I am interested in, is using this approach for sharing link sets (doi:10.1007/978-3-319-11964-9_7) for BridgeDb, our identifier mapping platform. But there is much more, of course: share life science databases on your laptop.

    This hack was triggered by a beer with Evan Bolton and Arto Bendiken. Yes, there is a Java library, hdt-java, and for me the easiest way to work out how to use a Java API, is to write a Bioclipse plugin. Writing the plugin is trivial, though setting up a Bioclipse development is less so: the New Wizard does the hard work in seconds. But then started the dependency hacking. The Jena version it depended on is incompatible with the version in Bioclipse right now, but that is not a big deal for Eclipse, and the outcome is that we have both version on the classpath :) That, however, did require me to introduce a new plugin, net.bioclipse.rdf.core with the IRDFStore, something I wanted to do for a long time, because that is also needed if one wants to use Sesame/OpenRDF instead of Jena.

    So, after lunch I was done with the code cleanup, and I got to the HDT manager again. Soon, I could open a HDT file. I first had the API method to read it into memory, but that's not what I wanted, because I want to open large HDT files. Because it uses Jena, it conveniently provides a Jena Model object, so adding SPARQL-ing support was easy; I cannot use the old SPARQL-ing code, because then I would start mixing Jena versions, but since all is Open Source, I just copied/pasted the code (which is written by me in the first place, doi:10.1186/2041-1480-2-s1-s6, interestingly, work that originates from my previous SWAT4LS talk :). Then, I could do this:
    It is file based, which has different from a full triple store server. So, questions arise about performance. Creating an index takes time and memory (1GB of heap space, for example). However, the index file can be shared (downloaded) and then a HDT file "opens" in a second in Bioclipse. Of course, the opening does not do anything special, like loading into memory, and should be compared to connecting to a relational database. The querying is what takes the time. Here are some numbers for the Wiktionary data that the RDFHDT team provides as example data set:
    However, I am not entirely sure what to compare this against. I will have to experiment with, for example, ChEMBL-RDF (maybe update the Uppsala version, see doi:10.1186/1758-2946-5-23). The advantage would be that ChEMBL data could easily be distributed along with Bioclipse to service the decision support features. Because the typical query is asking for data for a parcicular compound, not all compounds. If that works within less than 0.1 seconds, then this may give a nice user experience.

    But before I reach that, it needs a bit more hacking:
    1. take the approach I took with BridgeDb mapping databases for sharing HDT files (which has the advantage that you get a decent automatic updating system, etc)
    2. ensure I can query over more than one HDT file
    And probably a bit more.

    Wikidata and WikiPathways
    After the coffee break I joined the Wikidata people, and sat down to learn about the bots. However, Andra wanted to finish something else first, where I could help out. Considering I probably manage to hack up a bot anyway, we worked on the following. Multiple database about genes, proteins, and metabolites like to link these biological entities to pathways in WikiPathways (doi:10.1093/nar/gkv1024). Of course, we love to collaborate with all the projects that integrate WikiPathways into their systems, but I personally rather use a solution that services all needs. If only because then people can do this integration without needing our time. Of course, this is an idea we pitched about a year ago in the Enabling Open Science: WikiData for Research proposal (doi:10.5281/zenodo.13906).

    That is, would it not be nice of people can just pulled the links between the biological entities to WikiPathways from Wikidata, using one of the many APIs they have (SPARQL, REST), supporting multiple formats (XML, JSON, RDF)? I think so, as you might have guessed. So does Andra, and he asked me if I could start the discussions in the Wikidata community, which I happily did. I'm not sure about the outcome, because despite having links like these is not of their prime interest - they did not like the idea of links to the Crystallography Open Database much yet, with the argument it is a one-to-many relation - though this is exactly what the PDB identifier is too, and that is accepted. So, it's a matter of notability again. But this is what the current proposal looks like:


    Let's see how the discussion unfolds. Please feel tree to coin in and show your support, comments, questions, or opposition, so that we can together get this right.

    Chemistry Development Kit
    There is undoubtedly a lot more, but I have been summarizing the meeting for about three hours now, getting notes together etc. A last thing I want to mention now, is the CDK. Cheminformatics is, afterall, a critical feature of life science data, and spoke with a few about the CDK. And I visited NextMove Software on Friday where John May works nowadays, who did a lot of work on the CDK recently (we also spoke about WikiPathways and eNanoMapper). NextMove is doing great stuff (thanks for the invitation), and so did John during his PhD in Chris Steinbeck's group at the EBI. But during the conference I also spoke with others about the CDK and following up on these conversations.

    CDK Literature #8

    Tool validation
    The first paper this week is a QSAR paper. In fact, it does some interesting benchmarking of a few tools with a data set of about 6000 compounds. It includes looking into the applicability domain, and studies the error of prediction for compounds inside and outside the chemical space defined by the training set. The paper indirectly uses the CDK descriptor calculation corner, by using EPA's T.E.S.T. toolkit (at least one author, Todd Martin, contributed to the CDK).

    Callahan, A., Cruz-Toledo, J., Dumontier, M., Apr. 2013. Ontology-Based querying with Bio2RDF's linked open data. Journal of Biomedical Semantics 4 (Suppl 1), S1+. URL http://dx.doi.org/10.1186/2041-1480-4-s1-s1

    Tetranortriterpenoid
    Arvind et al. study tetranortriterpenoids using a QSAR approach involving COMFA and the CPSA descriptor (green OA PDF). The latter CDK descriptor is calculated using Bioclipse. The study finds that using compound classes can improve the regression.

    Arvind, K., Anand Solomon, K., Rajan, S. S., Apr. 2013. QSAR studies of tetranortriterpenoid: An analysis through CoMFA and CPSA parameters. Letters in Drug Design & Discovery 10 (5), 427-436. URL http://dx.doi.org/10.2174/1570180811310050010

    Accurate monoisotopic masses
    Another useful application of the CDK is the Java wrapping of the isotope data in the Blue Obelisk Data Repository (BODR). Mareile Niesser et al. use Rajarshi's rcdk package for R to calculate the differences in accurate monoisotopic masses. They do not cite the CDK directly, but do mention it by name in the text.

    Niesser, M., Harder, U., Koletzko, B., Peissner, W., Jun. 2013. Quantification of urinary folate catabolites using liquid chromatography–tandem mass spectrometry. Journal of Chromatography B 929, 116-124. URL http://dx.doi.org/10.1016/j.jchromb.2013.04.008

    Bioclipse 2.6.2 with recent hacks #2: reading content from online Google Spreadsheets

    Update 2015-06-04: the authentication with the Google Drive has changed; I need to update the code and am afraid I missed the point, so that the below code is not working right now :(

    Similar to the previous post in this new series, this post will outline how to make use of the Google Spreadsheet functionality in Bioclipse 2.6.2. But before I provide the steps needed to install the functionality, first consider this Bioclipse JavaScript:

      google.setUserCredentials(
        "your.account", "16charpassword"
      )
      google.listSpreadsheets()
      google.listWorksheets(
        "ORCID @ Maastricht University"
      )
      data = google.loadWorksheet(
        "ORCID @ Maastricht University",
        "with works"
      )

    Because that's what this functionality: read data from Google Spreadsheets. That opens up an integration of Google Spreadsheets with your regular data analysis workflows. I am not sure of Bioclipse is the only tool that embeds the Google client code to access these services, and can imagine similar functionality is available from R, Taverna, and KNIME.

    Getting your credentials
    The first call to the google manager requires your login details. But don't use your regular password: you need a application password. This specific, sixteen character, password needs to be manually created using your webbrowser, following this link. Create a new App password (”Other (Customized name)” ) and use this password in Bioclipse.

    Installing Bioclipse 2.6.2 and the Google Spreadsheet functionality
    The first you need to do (unless you already did that, of course) is install Bioclipse 2.6.2 (the beta) and enable the advanced mode. This is outline in my previous post up to Step 1. The update site, obviously, is different, and in Step 2 in that post you should use:

    1. Name: Open Notebook Science Update Site
    2. Location: http://pele.farmbio.uu.se/jenkins/job/Bioclipse.ons/lastSuccessfulBuild/artifact/buckminster.output/net.bioclipse.ons_site_1.0.0-eclipse.feature/site.p2/
    Yes, the links only seem to get longer and longer. Just continue to the next step and install the Google Feature:


    That's it, have fun!

    Oh, and this hack is not so recent. I wrote the first version of the net.bioclipse.google plugin and matching manager, as used in the above code, dates back to January 2011, when I had just started at the Karolinska Institutet. But the code to download data from spreadsheets is even older, and goes back to 2008 when I worked with Cameron Neylon and Pierre Lindenbaum on creating RDF for data being collected by Jean Claude-Bradley. If you're interested, check the repository history and this book chapter.

    Dr. J. Alvarsson: Bioclipse 2, signature fingerprints, and chemometrics

    Last Friday I attended the PhD defense of, now, Dr. Jonathan Alvarsson (Dept. Pharmaceutical Biosciences, Uppsala University), who defended his thesis Ligand-based Methods for Data Management and Modelling (PDF). Key papers resulting from his work include (see the list below) one about Bioclipse 2, particularly covering his work on plug-able managers that enrich scripting languages (JavaScript, Python, Groovy) with domain specific functionality, which I make frequent use of (doi:10.1186/1471-2105-10-397), a paper about Brunn, a LIMS system for microplates, which is based on Bioclipse 2 (doi:10.1186/1471-2105-12-179), and a set of chemometrics papers, looking at scaling up pattern recognition via QSAR model buildings (e.g. doi:10.1021/ci500361u). He is also author on several other papers and we collaborated on several of them, so you will find his name in several more papers. Check his Google Scholar profile.

    In Sweden there is one key opponent, though further questions can be asked by a few other committee members. John Overington (formerly of ChEMBL) was the opponent and he asked Jonathan questions for at least an hour, going through the thesis. Of course, I don't remember most of it, but there were a few that I remember and want to bring up. One issue was about the uptake of Bioclipse by the community, and, for example, how large the community is. The answer is that this is hard to answer; there are download statistics and there is actual use.

    Download statistics of the Bioclipse 2.6.1 release.
    Like citation statistics (the Bioclipse 1 paper was cited close to 100 times, Bioclipse 2 is approaching 40 citations), download statistics reflect this uptake but are hardly direct measurements. When I first learned about Bioclipse, I realized that it could be a game changer. But it did not. I still don't quite understand why not. It looks good, is very powerful, very easy to extend (which I still make a lot of use of), it is fairly easy to install (download 2.6.1 or the 2.6.2 beta), etc. And it resulted in a large set of applications, just check the list of papers.

    One argument could be, it is yet another tool to install, and developers are turning to web-based solutions. Moreover, the cheminformatics community has many alternatives and users seem to prefer smaller, more dedicated tools, like a file format converter, like Open Babel, or a dedicated descriptor calculator, like PaDEL. Simpler messages seem more successful; this is expected for politics, but I guess science is more like politics that we like to believe.

    A second question I remember was about what Jonathan would like to see changed in ChEMBL, the database Overington has worked so long on. As a data analyst you are in a different mind set: rather than thinking about single molecules, you think about classes of compounds, and rather than thinking about the specific interaction of a drug with a protein, you think about the general underlying chemical phenomenon. A question like this one requires a different kind of thinking: it needs one to think like an analytical chemist, that worries about the underlying experiments. Obvious, but easy to return too once thinking at a higher (different) level. That experimental error information in ChEMBL can actually support modelling, is something we showed using Bayesian statistics (well, Martin Eklund particularly) in Linking the Resource Description Framework to cheminformatics and proteochemometrics (doi:10.1186/2041-1480-2-S1-S6) by including the assay confidence assigned by the ChEMBL curation team. If John would have asked me, I would have said I wanted ChEMBL to capture as much of the experimental details as possible.

    Integration of RDF technologies in Bioclipse. Alvarsson worked on the integration of the RDF editor in Bioclipse.
    The screenshot shows that if you click a RDF resource reflecting a molecule, it will show the structure (if there is a
    predicte providing the SMILES) and information by predicates in general.
    The last question I want to discuss was about the number of rotable bonds in paracetamol. If you look at this structure, you would identify four purely σ bonds (BTW, can you have π bonds without sigma bonds?). So, four could be the expected answer. You can argue that the peptide bond should not be considered rotatable, and should be excluded, and thus the answer would be two. Now, the CDK answers two, as shown in an example of descriptor calculation in the thesis. I raised my eyebrows, and thought: "I surely hope this is not a bug!". (Well, my thoughts used some more words, which I will not repeat here.)

    But thinking about that, I valued the idea of Open Source: I could just checked, and took my tablet from my bag, opened up a browser, went to GitHub, and looked up the source code. It turned out it was not a bug! Phew. No, in fact, it turned out that the default parameters of this descriptor excludes the terminal rotatable bonds:


    So, problem solved. Two follow up questions, though: 1. can you look up source code during a thesis defense? Jonathan had his laptop right in front of him. I only thought of that yesterday, when I was back home, having dinner with the family. 2. I wonder if I should discuss the idea of parameterizable descriptors more; what do you think? There is a lot of confusion about this design decision in the CDK. For example, it is not uncommon that the CDK only calculates some two hundred descriptor values, whereas tool X calculates more than a thousand. Mmmm, that always makes me question the quality of that paper in general, but, oh well...

    There also was a nice discussion about chemometrics. Jonathan argues in his thesis that a fast modeling method may be a better way forward at this moment, and more powerful statistical methods. He presented results with LIBLINEAR and signature fingerprints, comparing it to other approaches. The latter was compared with industry standards, like ECFP (which Clark and Ekins implemented for the CDK and have been using in Bayesian statistics on the mobile phone), and for the first Jonathan showed that LINLINEAR can handle more data than regular SVM libraries, and that the using more training data still improves the model more than a "better" statistical method (which quite matches my own experiences). And with SVMs, finding the right parameters typically is an issue. Using a RBF kernel only adds one, and since Jonathan also indicated that the Tanimoto distance measure for fingerprints is still a more than sufficient approach, which makes me wonder if the chemometrics models should not be using a Tanimoto kernel instead of a RBF kernel (though doi:10.1021/ci800441c suggests RBF may really do better for some tasks, but at the expense of more parameter optimization needed).

    To wrap up, I really enjoyed working with Jonathan a lot and I think he did excellent multidisciplinary work. I am also happy that I was able to attend his defense and the events around that. In no way does this post do justice or reflect the defense; it merely reflects that how relevant his research is in my opinion, and just highlights some of my thoughts during (and after) the defense.

    Jonathan, congratulations!

    Spjuth, O., Alvarsson, J., Berg, A., Eklund, M., Kuhn, S., Mäsak, C., Torrance, G., Wagener, J., Willighagen, E. L., Steinbeck, C., Wikberg, J. E., Dec. 2009. Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 10 (1), 397+.
    Alvarsson, J., Andersson, C., Spjuth, O., Larsson, R., Wikberg, J. E. S., May 2011. Brunn: An open source laboratory information system for microplates with a graphical plate layout design process. BMC Bioinformatics 12 (1), 179+.
    Alvarsson, J., Eklund, M., Engkvist, O., Spjuth, O., Carlsson, L., Wikberg, J. E. S., Noeske, T., Oct. 2014. Ligand-Based target prediction with signature fingerprints. J. Chem. Inf. Model. 54 (10), 2647-2653.

    Bioclipse 2.6.2 with recent hacks #3: using functionality from the OWLAPI

    Update: if you had problems installing this feature, please try again. Two annoying issues have been fixed now.

    Third in this series is this post about the Bioclipse plugin I wrote for the OWLAPI library. This manager I wrote to learn how the OWLAPI is working. The OWLAPI feature is available from the same update site as the Linked Data Fragments, so you can just follow the steps outlined here (if you had not already).

    Using the manager
    The manager has various methods, for example, for loading an OWL ontology:

    ontology = owlapi.load(
      "/eNanoMapper/enanomapper.owl", null
    );

    If your ontology imports other ontologies, you may need to tell the OWLAPI first where to find those, by defining mappings. For example, I could do before making the above call:

    mapper = null; // initially no mapper
    mapper = owlapi.addMapping(mapper,
      "http://purl.bioontology.org/ontology/npo",
      "/eNanoMapper/npo-asserted.owl"
    );
    mapper = owlapi.addMapping(mapper,
      "http://www.enanomapper.net/ontologies/" + 
      "external/ontology-metadata-slim.owl",
      "/eNanoMapper/ontology-metadata-slim.owl"
    )

    I can list which ontologies have been imported with:

    imported = owlapi.getImportedOntologies(ontology)
    for (var i = 0; i < imported.size(); i++) {
      js.say(
        imported.get(i).getOntologyID().getOntologyIRI()
      )
    }

    When the ontology is successfully loaded, I can list the classes and various types of properties:

    classes = owlapi.getClasses(ontology)
    annotProps = owlapi.getAnnotationProperties(ontology)
    declaredProps = owlapi.getPropertyDeclarationAxioms(ontology)

    There likely needs some further functionality that needs adding, and I love to hear about what methods you like to see added.

    CDK Literature #6

    Originally a series I started in the CDK News, later for some issues part of this blog, and then for some time on Google+, CDK Literature is now returning to my blog. BTW, I created a poll about whether CDK News should be picked up again. The reason why we stopped was that we were not getting enough submissions anymore.

    For those who are not familiar with the CDK Literature series, the posts discuss recent literature that cites one of the two CDK papers (the first one is now Open Access). A short description explains what the paper is about and why the CDK is cited. For that I am using the CiTO, of which the data is available from CiteULike. That allows me to keep track how people are using the CDK, resulting, for example, in these wordles.

    I will try to pick up this series again, but may be a bit more selective. The number of CDK citing papers has grown extensively, resulting in at least one new paper each week (indeed, not even close to the citation rate of DAVID). I aim at covering ~5 papers each week.

    Ring perception
    Ring perception has evolved in the CDK. Originally, there was the Figueras algorithm (doi:10.1021/ci960013p) implementation which was improved by Berger et al. (doi:10.1007/s00453-004-1098-x). Now, John May (the CDK release manager) has reworked the ring perception in the CDK, also introduction a new API which I covered recently. Also check John's blog.

    May, J. W., Steinbeck, C., Jan. 2014. Efficient ring perception for the chemistry development kit. Journal of Cheminformatics 6 (1), 3+. URL http://dx.doi.org/10.1186/1758-2946-6-3

    Screening Assistant 2
    A bit longer ago, Vincent Le Guilloux published the second version their Screening Assistant tool fo rmining large sets of compounds. The CDK is used for various purposes. The paper is already from 2012 (I am that much behind with this series) and the source code on SourceForge does not seem to have change much recently.

    Figure 2 of the paper (CC-BY) shows an overview of the Screening Assistant GUI.
    Guilloux, V. L., Arrault, A., Colliandre, L., Bourg, S., Vayer, P., Morin-Allory, L., Aug. 2012. Mining collections of compounds with screening assistant 2. Journal of Cheminformatics 4 (1), 20+. URL http://dx.doi.org/10.1186/1758-2946-4-20

    Similarity and enrichment
    Using fingerprints for compound enrichment, i.e. finding the actives in a set of compounds, is a common cheminformatics application. This paper by Avram et al. introduces a new metric (eROCE). I will not go into details, which are best explained by the paper, but note that the CDK is used via PaDEL and that various descriptors and fingerprints are used. The data set they used to show the performance is one of close to 50 thousand inhibitors of ALDH1A1.

    Avram, S. I., Crisan, L., Bora, A., Pacureanu, L. M., Avram, S., Kurunczi, L., Mar. 2013. Retrospective group fusion similarity search based on eROCE evaluation metric. Bioorganic & Medicinal Chemistry 21 (5), 1268-1278. URL http://dx.doi.org/10.1016/j.bmc.2012.12.041

    The International Chemical Identifier
    It is only because Antony Williams advocated the importance of the InChI in this excellent slides that I list this paper again: I covered it here in more detail already. The paper describes work by Sam Adams to wrap the InChI library into a Java library, how it is integrated in the CDK, and how Bioclipse uses it. It does not formally cite the CDK, which now feels silly. Perhaps I did not add because of fear of self-citation? Who knows. Anyway, you find this paper cited on slide 30 in aforementioned presentation from Tony.

    Spjuth, O., Berg, A., Adams, S., Willighagen, E., 2013. Applications of the InChI in cheminformatics with the CDK and bioclipse. Journal of Cheminformatics 5 (1), 14+. URL http://dx.doi.org/10.1186/1758-2946-5-14

    Predictive toxicology
    Cheminformatics is a key tool in predictive toxicology. I starts with the assumption that compounds of similar structure, behave similarly when coming in contact with biological systems. This is a long-standing paradigm which turns out to be quite hard to use, but has not shown to be incorrect either. This paper proposes a new approach using Pareto points and used the CDK to calculate logP values for compounds. However, I cannot find which algorithm it is using to do so.

    Palczewska, A., Neagu, D., Ridley, M., Mar. 2013. Using pareto points for model identification in predictive toxicology. Journal of Cheminformatics 5 (1), 16+. URL http://dx.doi.org/10.1186/1758-2946-5-16

    Cheminformatics in Python
    ChemoPy is a tool to do cheminformatics in Python. This paper cites the CDK just as one of the tools available for cheminformatics. The tool is available from Google Code. It has not been migrated yet, but they still have about half a year to do so. Then again, given that there does not seem to have been activity since 2013, I recommend looking at Cinfony instead (doi:10.1186/1752-153X-2-24): exposed the CDK and is still maintained.

    Cao, D.-S., Xu, Q.-S., Hu, Q.-N., Liang, Y.-Z., Apr. 2013. ChemoPy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29 (8), 1092-1094. URL http://dx.doi.org/10.1093/bioinformatics/btt105

    Bioclipse 2.6.2 with recent hacks #1: Wikidata & Linked Data Fragments

    Bioclipse dialog to upload chemical
    structures to an OpenTox repository.
    Us chem- and bioinformaticians have it easy when it comes to Open Science. Sure, writing documentation, doing unit testing, etc, takes a lot of time, but testing some new idea is done easily. Yes, people got used to that, so trying to explain that doing it properly actually takes long (documentation, unit testing) can be rather hard.

    Important for this is a platform that allows you to easy experiment. For many biologists this environment is R or Python. To me, with most of the libraries important to me written in Java, this is Groovy (e.g. see my Groovy Cheminformatics book) and Bioclipse (doi:10.1186/1471-2105-8-59). Sometimes these hacks grow to be full papers, like with what started with OpenTox support (doi:10.1186/1756-0500-4-487) which even paved (for me) the way to the eNanoMapper project!

    But often these hacks are just for me personal, or at least initially. However, I have no excuse to not make this available to a wider audience too. Of course, the source code is easy, and I normally have even the smallest Bioclipse hack available somewhere on GitHub (look for bioclipse.* repositories). But it is getting even better, now that Arvid Berg (Bioclipse team) gave me the pointers to ensure you can install those hacks, taking advantage from Uppsala's build system.

    So, from now on, I will blog how to install Bioclipse hacks I deem useful for a wider audience, starting with this post on my Wikidata/Linked Data Fragments hack I used to get more CAS registry number mappings to other identifiers.

    Install Bioclipse 2.6.2
    The first thing you need is Bioclipse 2.6.2. That's the beta release of Bioclipse, and required for my hacks. From this link you can download binary nightly builds for GNU/Linux, MS-Windows, and OS/X. For the first two 32 and 64 bit build are available. You may need to install Java and version 1.7 should do fine. Unpack the archive, and then start the Bioclipse executable. For example, on GNU/Linux:

      $ tar zxvf Bioclipse.2.6.2-beta.linux.gtk.x86.tar.gz
      $ cd Bioclipse.2.6.2-beta/
      $ ./bioclipse

    Install the Linked Data Fragments manager
    The default update site already has a lot of goodies you can play with. Just go to Install → New Feature.... That will give you a nice dialog like this one (which allows you to install the aforementioned Bioclipse-OpenTox feature):



    But that update site doesn't normally have my quick hacks. This is where Arvid's pointers come in, which I hope to carefully reproduce here so that my readers can install other Bioclipse extensions too.

    Step 1: enable the 'Advanced Mode'
    The first step is to enable the 'Advanced Mode'; that is, unless you are advanced, forget about this. Fortunately, the fact that you haven't given up on reading my blog yet is a good indicated you are advanced. Go to the Window → Preferences menu and enable the 'Advanced Mode' in the dialog, as shown here:


    When done, click Apply and close the dialog with OK.

    Step 2: add an update site from the Uppsala build system
    The first step enables you to add arbitrary new update sites, like update sites available from the Uppsala build system, by adding a new menu option. To add new update sites, use this new menu option and select Install → Software from update site...:


    By clicking the Add button, you go this dialog where you should enter the update site information:


    This dialog will become a recurrent thing in this series, though the content may change from time to time. The information you need to enter is (the name is not too important and can be something else that makes sense to you):

    1. Name: Bioclipse RDF Update Site
    2. Location: http://pele.farmbio.uu.se/jenkins/job/Bioclipse.rdf/lastSuccessfulBuild/artifact/site.p2/

    After clicking OK in the above dialog, you will return to the Available Software dialog (shown earlier).

    Step 3: installing the Linked Data Fragments Feature
    The  Available Software dialog will now show a list of features available from the just added update site:


    You can see the Linked Data Fragments Feature is now listed which you can select with the checkbox in front of the name (as shown above). The Next button will walk you through a few more pages in this dialog, providing information about dependencies and a page that requires you to accept the Open Source licenses involved. And at the end of these steps, it may require you to reboot Bioclipse.

    Step 4: opening the JavaScript Console and verify the new extension is installed
    Because the Linked Data Fragments Feature extends Bioclipse with a new, so-called manager (see doi:10.1186/1471-2105-10-397), we need to use the JavaScript Console (or Groovy Console, or Python Console, if you prefer those languages). Make sure the JavaScript Console is open, or do this via the menu Windows → Show View → JavaScript Console and type in the console view man ldf which should result in something like this:


    You can also type man ldf.createStore to get a brief description of the method I used to get a Linked Data Fragments wrapper for Wikidata in my previous post, which is what you should reread next.

    Have fun and looking forward to hear how you use Linked Data Fragments with Bioclipse!

    Getting CAS registry numbers out of WikiData

    doi 10.15200/winn.142867.72538

    I have promised my Twitter followers the SPARQL query you have all been waiting for. Sadly, you had to wait for it for more than two months. I'm sorry about that. But, here it is:
      PREFIX wd: <http://www.wikidata.org/entity/>

      SELECT ?compound ?id WHERE {
        ?compound wd:P231s [ wd:P231v ?id ] .
      }
    What this query does is ask for all things (let's call whatever is behind the identifier is a "compound"; of course, it can be mixtures, ill-defined chemicals, nanomaterials, etc) that have a CAS registry identifier. This query results in a nice table of Wikidata identifiers (e.g. Q47512 is acetic acid) and matching CAS numbers, 16298 of them.

    Because Wikidata is not specific to the English Wikipedia, CAS numbers from other origin will show up too. For example, the CAS number for N-benzylacrylamide (Q10334928) is provided by the Portuguese Wikipedia:


    I used Peter Ertl's cheminfo.org (doi:10.1186/s13321-015-0061-y) to confirm this compound indeed does not have an English page, which is somewhat surprising.

    The SPARQL query uses a predicate specifically for the CAS registry number (P231). Other identifiers have similar predicates, like for PubChem compound (P662) and Chemspider (P661). That means, Wikidata can become a community crowdsource of identifier mappings, which is one of the things Daniel Mietchen, me, and a few others proposed in this H2020 grant application (doi:10.5281/zenodo.13906). The SPARQL query is run by the Linked Data Fragments platform, which you should really check out too, using the Bioclipse manager I wrote around that.

    The full Bioclipse script looks like:
      wikidataldf = ldf.createStore(
        "http://data.wikidataldf.com/wikidata"
      )

      // P231 CAS
      identifier = "P231"
      type = "cas"

      sparql = """
      PREFIX wd:

      SELECT ?compound ?id WHERE {
        ?compound wd:${identifier}s [ wd:${identifier}v ?id ] .
      }
      """
      mappings = rdf.sparql(wikidataldf, sparql)

      // recreate an empty output file
      outFilename = "/Wikidata/${type}2wikidata.csv"
      if (ui.fileExists(outFilename)) {
        ui.remove(outFilename)
        ui.newFile(outFilename)
      }

      // safe to a file
      for (i=1; i<=mappings.rowCount; i++) {
        wdID = mappings.get(i, "compound").substring(3)
        ui.append(
          outFilename,
          wdID + "," + mappings.get(i, "id") + "\n"
        )
      }
    BTW, of course, all this depends on work by many others including the core RDF generation with the Wikidata Toolkit. See also the paper by Erxleben et al. (PDF).


    Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandečić, D., 2014. Introducing wikidata to the linked data web. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (Eds.), The Semantic Web – ISWC 2014. Vol. 8796 of Lecture Notes in Computer Science. Springer International Publishing, pp. 50-65. URL http://dx.doi.org/10.1007/978-3-319-11964-9_4

    Mietchen, D., Others, M., Anonymous, Hagedorn, G., Jan. 2015. Enabling open science: Wikidata for research. URL http://dx.doi.org/10.5281/zenodo.13906

    Ertl, P., Patiny, L., Sander, T., Rufener, C., Zasso, M., Mar. 2015. Wikipedia chemical structure explorer: substructure and similarity searching of molecules from Wikipedia. Journal of Cheminformatics 7 (1), 10+. URL http://dx.doi.org/10.1186/s13321-015-0061-y