Revisited: Handling SD files with JavaScript in Bioclipse

After asking on the Bioclipse users list it turns out there was an unpublished manager method to trigger parsing of the SDF properties (Arvid++), allowing to simplify creation of the index and not needed parsing of the chemical structures into a CDK molecule model.

That simplifies my earlier code to:

  hmdbIndex =
    molTable.createSDFIndex(
      "/WikiPathways/hmdb.sdf"
    );
  props = new java.util.HashSet();
  props.add("HMDB_ID");
  molTable.parseProperties(hmdbIndex, props);
  
  idIndex = new java.util.HashMap();
  molCount = hmdbIndex.getNumberOfMolecules();
  for (i=0; i<molCount; i++) {
    hmdbID = hmdbIndex.getPropertyFor(i, "HMDB_ID")
    idIndex.put(hmdbID, i);
  }

The next step in my use case is process some input (WikiPathways GPML files to be precise), detect what HMDB identifier is used, extract the SD file entry for that identifier and append it to a new SD file (using a new ui.append() method):

  hmdbCounter = idIndex.get(idStr)
  sdEntry = hmdbIndex.getRecord(hmdbCounter)
  sdEntry = sdEntry.substring(0, sdEntry.indexOf("M  END"))
  ui.append("/WikiPathways/db.sdf", sdEntry);
  ui.append("/WikiPathways/db.sdf", "M  END\n");
  ui.append("/WikiPathways/db.sdf", "> <WPM>\n");
  ui.append("/WikiPathways/db.sdf", "WPM" + (Integer.toString(wpmId)).substring(1) + "\n");
  ui.append("/WikiPathways/db.sdf", "\n");
  ui.append("/WikiPathways/db.sdf", "\$\$\$\$\n");

This code actually does a bit more than copying the SD file entry: it also removes all previous SD fields and replace this with a new, internal identifier. Using that identifier, I track some metadata on this metabolite.

Now, there are a million ways of implementing this workflow. If you really want to know, I chose this one because HMDB identifiers is a more prominent ID used in WikiPathways, and for this one, as well as ChEBI, I can use a SD file. For ChemSpider and PubChem identifiers, however, I plan to use the matching Bioclipse client code to pull in MDL molfiles. Bioclipse has functionality for all these needs available as extensions. 

Handling SD files with JavaScript in Bioclipse

I finally got around to continuing with a task to create an SD file for WikiPathways. The problem is more finding the time, than doing it, and the tasks are basically:
  1. iterating over all metabolites in the GPML files
  2. extract the Xref's database and database identifier (see previous link)
  3. extract the molfile from the database SD file
  4. give the WikiPathways metabolite a unique identifier
  5. record that WikiPathways metabolite has a molfile
  6. append that molfile along with the new WikiPathways metabolite ID in a new SD file
It turns out that I can use Uppsala's excellent SD functionality in Bioclipse (using indexing, it opens 2 GB SD files for me) is also available from the JavaScript command line:

  hmdbIndex = molTable.createSDFIndex(
    "/WikiPathways/hmdb.sdf"
  );
  
  idIndex = new java.util.HashMap();
  molCount = hmdbIndex.getNumberOfMolecules();
  for (i=0; i<molCount; i++) {
    mol = hmdbIndex.getMoleculeAt(i);
    if (mol != null) {
      hmdbID = mol.getAtomContainer().getProperty(
        "HMDB_ID"
      );
      idIndex.put(hmdbID, i);
    }
  }

Using this approach, I can create an index by HMDB identifier of molfiles in the HMDB SD file extract just those molfiles which are found in WikiPathways, and create a new WikiPathways dedicated SD file. When I have the HMDB identifiers done, ChEBI, PubChem, and ChemSpider will follow.

Looking for a PhD and a Postdoc to work on Open Science Nanosafety

I am happy that I got my first research grant awarded (EU FP7), which should start after all the contracts are signed etc, somewhere early 2014. The project is about setting up data needs for the analysis of nanosafety studies. And for this, I have the below two position vacancies available now. If you are keen on doing Open Science (CDK, Bioclipse, OpenTox, WikiPathways, ..., ...), working within the European NanoSafety Cluster, and have an affinity with understanding the systems biology of nanomaterials, then you may be interested in applying. Click the screenshots for full details.

PhD position



Postdoc position



New Google+ communities: Bioclipse, Cheminformatics, Semantic MediaWiki

After spending countless hours on Google+, I have realized how useful their new group feature is, for sharing/finding interesting stuff happening around different technologies. It's nice handling of previews of movies, images, webpages etc makes it so much easier to spot interesting stuff. IMO it works FAR better than e.g. twitter, for this.

E.g. by subscribing to groups for the topics you are interested in (I have over 50...), I get tons of interesting stuff on your Google+ home page all the time.

While Google+ groups definitely don't replace mailing lists and IRC, which are superior for discussions, it is a great complement for sharing interesting stuff happening around a technology.

With this in mind, during the last week or so, I've tried to make sure that a few of my favourite softwares and topics have groups, which resulted in a few new ones:

... so make sure to join those of these that you like, and post some interesting stuff there! :)

read more

Why I still prefer CML (was: #ACSIndy formats session)

At the #ACSIndy meeting there was a session one chemical formats (I hope the slides will all come online). Some key tweets (thanx to Tony for the coverage):

But I am still fan of the Chemical Markup Language. In fact, I started using this when XML was not even standardized yet. Even CML has a SGML background. Well, fairly, only months before XML made it into a recommendation, and CML followed. CML is flexible, which to some is a downside; to me it is a big advantage, as it allows me to easily extend it. It support ontologies to do this, and is therefore one of the most machine readable chemical formats.

Of course, a lot depends on the libraries that you are using. For reading, there are various approaches I have taken. Originally, I wrote a library (Willighagen2011) that supported the convention idea in CML, which is a pain to many. This feature is still actively used in Bioclipse and the CDK! Of course, many cheminformaticians do not care too much about explicit semantics, and the community standard is MDL molfile V2000 (someone has exact numbers?), even though the improved V3000 update is already 30 years old (see the first tweet!).

Of  course, browsing through all tweets, I think the session nicely showed some of the newer requirements, many required the extensions presented in this session. These extension may have been part of the original specification (is there an overview of specification documents of all industry standards?), but in many cases these will also be conventions. E.g. a common convention used by cheminformaticians is to use the bond order type 4 in MDL V2000 molfiles to reflect aromaticity, even though the specification defines it differently.

I hope all specifications of these updates and conventions will find their way to the web, with at least the rights to redistribute, allowing independent tools to properly implement these standards. (The right of modification is debatable for standards.)


Willighagen, E. L., 2001. Processing CML conventions in java. Internet Journal of Chemistry 4, 4+.
URL http://www.scribd.com/doc/14333194/Processing-CML-Conventions-in-Java

Next Bioclipse-OpenTox Workshop gig: Mainz, September 30

Ola and I will be giving a Bioclipse-OpenTox workshop in Mainz on Monday 30 September, during the OpenTox EU 2013 meeting. Places are filling up quickly (really :), so sign up now if you like to learn your way around in interactively accessing chemical liabilities.

The focus will likely be on the graphical user decision support interface, so here's what you would be able to do with scripting:


This workshop would also be great if you like to learn on how we use RDF for all of thise!

The Blue Obelisk Data Repository's 10 release

The Blue Obelisk Data Repository (BODR) is not so high profile as other Blue Obelisk projects, but equally important. Well, maybe a tid bit more important: it's a collection of core chemical and physical data, supporting computation chemistry and cheminformatics resources. For example, it is used by at least the CDK, Kalzium, and Bioclipse, but possibly more. Also, it's packages for major Linux distributions, such as Debian (btw, congrats to their 20th birthday!) and Ubuntu.

It doesn't change so often, but just has seen its 10th release. Actually, it was the first release in more than three years. But, fortunately, core chemical facts do not change often, nor much. So, this release has a number of data fixes, a few recent experimental isotope measurements, and also includes the new official names of the livermorium and flerovium elements. There is a full overview of changes.

BODR 10 is brought to you by Jean Brefort, Daniel Leidert, and I did some small bits too. Also big thanks to all project that keep using BODR and contribute by providing high quality feedback reports!

Oh, and if you use BODR isotope or element data, you are kindly invited to cite the one of the Blue Obelisk papers.

O'Boyle, N. et al. Open data, open source and open standards in chemistry: The blue obelisk five years on. Journal of Cheminformatics 3, 37+ (2011). URL http://www.jcheminf.com/content/3/1/37.
Guha, R. et al. The blue obelisk - interoperability in chemical informatics. Journal of Chemical Information and Modeling 46, 991-998 (2006). URL http://dx.doi.org/10.1021/ci050400b.

Analyzing WikiPathways metabolites in Bioclipse is easy with Groovy

Assume you downloaded a set of GPML pathway files from WikiPathways (doi:10.1371/journal.pbio.0060184) and placed those in a Bioclipse (doi:10.1186/1471-2105-10-397) workspace project, then you can easily analyse all metabolites:


Well, genes and proteins too, but I just happen to like metabolites more.

In fact, more interesting than printing the database source and identifier is perhaps opening them in a molecule table. Because I have not update the BridgeDb plugin to easily load identifier mapping databases, let's just use OPSIN (which recently saw its 1.5.0 release) and accept that we don't get to see all metabolites just yet:
    dataMap = bioclipse.fullPath("/WikiPathways/data/")
    gpmlFiles = new File(dataMap).listFiles()

    structureList = cdk.createMoleculeList()
    gpmlFiles.each { file ->
      def data = new XmlParser().parse(file)
      def metabolites = data.DataNode.findAll{
        it.'@Type'.contains('Metabolite')
      }
      metabolites.each() { node ->
        name = node.'@TextLabel'.trim()
        try {
          molecule = opsin.parseIUPACName(name)
          js.print("IUPAC name found: $name \n")
          structureList.add(molecule)
        } catch (Exception exception) {
          // OK, it was not an IUPAC name
        }
      }
    }
    ui.open(structureList)
Then we get to see this (for pathways with names starting with a "B"):


Anyway, this is just playing around. The point is, we can now hook up metabolite information in WikiPathways with any of the other functionality in Bioclipse, such as toxicity prediction, decision support, structural analysis (with the CDK), database look ups, etc, etc.

Or, and that was actually my primary goal this afternoon, to find all GPML Label elements with IUPAC names. But more on that next week.

Writing CDK code in Bioclipse

Of course, there is the CDK manager in Bioclipse (Fig. 2, doi:10.1186/1471-2105-10-397), which exposes a good deal of the commonly used functionality, but yesterday there was this question on the CHMINF-L mailing list about how to find all existing isotopes, given a certain exact mass.

Now, that's something we can do with CDK's IsotopeFactory (with thanx to John for these "stable" URIs!), but I did not want to propose using something arcane as a command line, so was wondering if it could be done in Bioclipse. So, I had to figure out how to create Java objects in JavaScript, a skill I once had but lost.

But, I regained it after some googling and ended up with this code:


It's an image as I had to make one for MyExperiment.org where the source code can be downloaded.

The key JavaScript calls as importPackage() and importClass(), and I am sure there are some things that can be optimized here (please leave a comment if you have a better solution), but once I have the IsotopeFactory, I iterate over all elements and for each ask all isotopes (yeah, I think there is some API methods missing here :). And for each isotope I test if it matches (within a certain error) the search mass, and if it matches, reports the mass number, element symbol, and exact mass (using the js manager).

Minting RDF from CSV files with Bioclipse #MIILS2013

The below slides are part of the introduction for the hands on for the #MIILS2013 course this afternoon. I had the participants look at creating RDF the hard way: using Bioclipse script (using this Bioclipse-OpenTox version). And, they had to follow the Open PHACTS RDF Guidelines, VoID specification, etc.

#MIILS2013 Practical: Bioclipse for RDF minting from Egon Willighagen