Open Notebook Science ONSSP #1: http://onsnetwork.org/

As promised, I slowly set out to explore ONSSPs (Open Notebook Science Service Providers). I do not have a full overview of solutions yet but found LabTrove and Open Notebook Science Network. The latter is a more clear ONSSP while the first seems to be the software.

So, my first experiment is with Open Notebook Science Network (ONSN). The platform uses WordPress, a proven technology. I am not a huge fan of the set up which has a lot of features making it sometimes hard to find what you need. Indeed, my first write up ended up as a Page rather than a Post. On the upside, there is a huge community around it, with experts in every city (literally!). But my ONS is now online and you can monitor my Open research with this RSS feed.

One of the downsides is that the editor is not oriented at structured data, though there is a feature for Forms which I may need to explore later. My first experiment was a quick, small hack: upgrade Bioclipse with OPSIN 1.6. As discussed in my #jcbms talk, I think it may be good for cheminformatics if we really start writing up step-by-step descriptions of common tasks.

My first observations are that it is an easy platform to work with. Embedding images is easy, and there should be option for chemistry extensions. For example, there is a Jmol plugin for WordPress, there are plugins for Semantic Web support (no clue which one I would recommend), an extensions for bibliographies are available too, if not mistaken. And, we also already see my ORCID prominently listed, and I am not sure if I did this, or whether this the ONSN people added this as a default feature.

Even better is the GitHub support @ONScience made me aware of, by @benbalter. The instructions were not crystal clear to me (see issues #25 and #26), some suggested fixes (pull request #27), it started working, and I now have a backup of my ONS at GitHub!

So, it looks like I am going to play with this ONSSP a lot more.

First steps in Open Notebook Science

Scheme 2 from this Beilstein Journal of Organic
Chemistry paper
by Frank Hahn et al.
I blogged a few weeks back I blogged about my first Open Notebook Science entry. The post suggest I will look at a few ONS service providers, but, honestly, Open Notebook Science Network serves my needs well.

What I have in mind, and will soon advocate, is that the total synthesis approach from organic chemistry fits chem- and bioinformatics research. It may not be perfect, and perhaps somewhat artificial (no pun intended), but I like the idea.

Compound to Compound
Basically, a lab notebook entry should be a step of something larger. You don't write Bioclipse from scratch. You don't do a metabolomics pathway enrichment analysis in one step, either. It's steps, each one taking you from one state to another. Ah, another nice analogy (see automata theory)! In terms of organic chemistry, from one compound to another. The importance here is that the analogy shows that there is no step you should not report. The same applies to cheminformatics: you cannot report a QSAR model without explaining how your cleaned up that SDF file you got from paper X (which still commonly is practised).

Methods Sections
Organic chemistry literature has well-defined templates on how to report the method for a reaction, including minimal reporting standards for the experimental results. For example, you must report chemical shifts, an elemental composition. In cheminformatics we do not have such templates, but there is no reason not too. Another feature that must be reported is the yield.

Reaction yield
The analogy with organic chemistry continues: each step has a yield. We must report this. I am not sure how, and this is one of the things I am exploring and will be part of my argument. In fact, the point of keeping track of variance introduced is something I have been advocating for longer. I think it really matters. We, as a research field, now publish a lot of cheminformatics and chemometrics work, without taking into account the yield of methods, though, for obvious reasons, very much more in chemometrics than in cheminformatics. I won't go into that now, but there is indeed a good part of benchmark work, but the point is, any cheminformatics "reaction" step should be benchmarked.

Total synthesis
The final aspect is, is that by taking this analogy, there is a clear protocol how cheminformatics, or bioinformatics, work must be reported: as a sequence of detailed small steps. It also means that intermediate "products" can be continued with in multiple ways: you get a directed graph of methods you applied and results you got.

You get something like this:

Created with Graphviz Workspace.

The EWx codes refer to entries in my lab notebook:
  1. EW4: Finding nodes in Anopheles gambiae pathways with IUPAC names
  2. EW5: Finding nodes in Homo sapiens pathways with IUPAC names
  3. EW6: Finding nodes in Rattus norvegicus pathways with IUPAC names
  4. EW7: converting metabolite Labels into DataNodes in WikiPathways GPML

Open Notebook Science
Of course, the above applies also if you do not do Open Notebook Science (ONS). In fact, the above outline is not really different from how I did my research before. However, I see value in using the ONS approach here. By having it Open, it

  1. requires me to be as detailed as possible
  2. allows others to repeat it
Combine this with the advantage of the total synthesis analogy:
  1. "reactions" can be performed in reasonable time
  2. easy branching of the synthesis
  3. clear methodology that can be repeated for other "compounds
  4. step towards minimal reporting standards for cheminformatics methods
  5. clear reporting structure that is compatible with journal requirements
OK, that is more or less the paper I want to write up and submit to the Jean-Claude Bradley Memorial Issue in the Journal of Cheminformatics and Chemistry Central. It is an idea, something that helps me, and I hope more people find useful bits in this approach.

EW6: Finding nodes in Rattus norvegicus pathways with IUPAC names

Hypothesis: Rattus norvegicus pathways in WikiPathways have DataNode’s with labels containing IUPAC names which can be tagged as type Metabolite.

Start date: 2014-09-05 End date: 2014-09-05

Description:

WikiPathways entries in GPML have DataNode objects and Label objects. It was found before [here, here] that metabolites can be encoded in pathways is Label objects and therefore not machine-readable as Metabolite-type DataNode and unable to have a database identifier. As such, these metabolites are unusable for pathway analysis of metabolomics data.

By processing these GPML files (they are XML-based) and iterating over all Label’s we can attempt to convert this label into chemical structure with OPSIN. This goes under the assumption that if OPSIN can parse the label into a structure, it is one. This label will be recorded along with the pathway identifier for manual inspection. For each structure it will also look up a ChemSpider identifier.

Methods

Unchanged protocol.

  • Download the GPML files from WikiPathways
  • Get a working Bioclipse development version (hard) with the OPSIN, InChI, and ChemSpider extensions
  • A Groovy script to iterate over the GPML, find <Label> elementsEach <Label> is parsed with OPSIN and if successful, generate an InChI
  • Use the InChIs to find ChemSpider identifiers
  • Output all as a text file and open metabolites in a Structure table

Report 

Similar to the experiment for Anopheles gambiae and Homo sapiens only curated pathways were analyzed, 143 in total, downloaded from WikiPathways.org on August 24. The Groovy script is used detailed in this experiment.

ratWP

 

The script found 47 Labels that are possibly metabolites in 8 different rat pathways. The full list was uploaded to Gist.

Conclusion: Rat pathways also include metabolites encoded in GPML <Label> elements.

EW5: Finding nodes in Homo sapiens pathways with IUPAC names

Hypothesis: Homo sapiens pathways in WikiPathways have DataNode’s with labels containing IUPAC names which can be tagged as type Metabolite.

Start date: 2014-09-01 End date: 2014-09-01

Description: WikiPathways entries in GPML have DataNode objects and Label objects. It was found before [here] that metabolites can be encoded in pathways is Label objects and therefore not machine-readable as Metabolite-type DataNode and unable to have a database identifier. As such, these metabolites are unusable for pathway analysis of metabolomics data.

By processing these GPML files (they are XML-based) and iterating over all Label’s we can attempt to convert this label into chemical structure with OPSIN. This goes under the assumption that if OPSIN can parse the label into a structure, it is one. This label will be recorded along with the pathway identifier for manual inspection. For each structure it will also look up a ChemSpider identifier.

Methods

  • Download the GPML files from WikiPathways
  • Get a working Bioclipse development version (hard) with the OPSIN, InChI, and ChemSpider extensions
  • A Groovy script to iterate over the GPML, find <Label> elements
  • Each <Label> is parsed with OPSIN and if successful, generate an InChI
  • Use the InChIs to find ChemSpider identifiers
  • Output all as a text file and open metabolites in a Structure table

Report 

metabolitesHuman

Similar to the experiment for Anopheles gambiae only curated pathways were analyzed, some 266 in total, downloaded from WikiPathways.org on August 24. The previous Groovy script was updated to point to the human pathways, but also to output the results in a file, rather than STDOUT. The new script was uploaded to myExperiment.org.

The script found 42 Labels that are possibly metabolites. The full list was uploaded to Gist. Again, labels were found which could not be linked to a single ChemSpider ID. For example, “5b-Pregnane-3,20-dione” which will results in these ChemSpider search hits: 21427590, 389575, 21232692, 21239075, 21237402. The result file also shows a few labels with new lines.

One metabolite was manually confirmed in WP1449Imidazoquinolin. Interestingly, the Label was visually “connected” with “(anti-viral compounds)” which have a ChEBI identifier and could be converted to a DataNode of type Metabolite too:

metabolitesHuman1

Most work, however, needs to be done in the Tryptophan metabolism pathway (WP465); many metabolites are not properly made machine readable.

Conclusion:

Human pathways also include metabolites encoded in GPML <Label> elements, even in the curated subset.

EW4: Finding nodes in Anopheles gambiae pathways with IUPAC names

Hypothesis: Anopheles gambiae pathways in WikiPathways have DataNode’s with labels containing IUPAC names which can be tagged as type Metabolite.

Start date: 2014-08-24 End date: 2014-08-24

Description: WikiPathways entries in GPML have DataNode objects and Label objects. It was found before [no published] that metabolites can be encoded in pathways is Label objects and therefore not machine-readable as Metabolite-type DataNode and unable to have a database identifier. As such, these metabolites are unusable for pathway analysis of metabolomics data.

By processing these GPML files (they are XML-based) and iterating over all Label’s we can attempt to convert this label into chemical structure with OPSIN. This goes under the assumption that if OPSIN can parse the label into a structure, it is one. This label will be recorded along with the pathway identifier for manual inspection. For each structure it will also look up a ChemSpider identifier.

Methods

  • Download the GPML files from WikiPathways
  • Get a working Bioclipse development version (hard) with the OPSIN, InChI, and ChemSpider extensions
  • A Groovy script to iterate over the GPML, find <Label> elements
  • Each <Label> is parsed with OPSIN and if successful, generate an InChI
  • Use the InChIs to find ChemSpider identifiers
  • Output all as a text file and open metabolites in a Structure table

Report 

anophelesMetabolites

Twelve WikiPathways for Anopheles gambiae were downloaded part of the analysis collection. In the future, uncurated pathways can also be included, anticipating to have more metabolites not annotated as Metabolite type. A custom Groovy script for Bioclipse was used, based on a previous similar script available from myExperiment.org. The updated script has been made available on myExperiment.org too. The results of running this script are visible in the above screenshot.

Key calls to Bioclipse managers used in this script, in addition to using the Groovy XMLParser, are:

  • cdk.createMoleculeList()
  • opsin.parseIUPACName(name)
  • structureList.add(molecule)
  • inchi.generate(molecule)
  • chemspider.resolve(inchiKey)

Four metabolites were found, in one pathway (WP1230):

Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node b93 -> Serine -> MTCFGRXMJLQNBG-UHFFFAOYSA-N -> CSID: [597]
Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node ff7 -> Glycine -> DHMQDGOQFOQNFH-UHFFFAOYSA-N -> CSID: [730]
Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node c8c -> Deoxythymidine monophosphate -> WVNRRNJFRREKAR-UHFFFAOYSA-N -> CSID: [315142]
Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node a47 -> Deoxyuridine monophosphate -> JSRLJPSBLDHEIO-UHFFFAOYSA-N -> CSID: [21537275, 668, 21230588]

Three metabolites have a single ChemSpider identifier, whereas one has three ChemSpider identifiers.

Visual inspection of WP1230 (revision 68447) confirms our hypothesis:

anophelesMetabolites1

Conclusion: Anopheles gambiae pathways indeed also include metabolites encoded in GPML <Label> elements.

EW3: Exposing more Jena functionality in Bioclipse

Hypothesis: Jena functionality for triple set comparison can be exposed via Bioclipse script

Start date: 2014-08-20 End date: 2014-08-20

Description: Bioclipse in the development branch mostly uses Jena for handling RDF data. The “rdf” manager already exposes various features of Jena. We here wish to expose the functionality of Jena to make unions, intersections, and differences of two triple stores (“Model”s in Jena terms) and to use the experimental shortest path method from the OntTool class.

Methods

  • start with a Eclipse development environment including the bioclipse.cheminformatics repository
  • define additional methods in the IRDFManager interface with the proper code
  • write implementations of these methods in the RDFManager class
  • publish the patches

Report While I still had a development environment from this step, Bioclipse no longer properly booted. Between that experiment and this one various things happened:

  1. Bioclipse had a new target platform
  2. I moved to a 64bit operating system
  3. I only reinstalled Java8

For resetting the target platform the normal protocol was used, though I had to repeat it a few times to get fully working. As usual, I first had to ask Arvid in Uppsala before it really started working (#overlyhonestmethods). There may have been a confounding issue with not having the proper javax.xml.soap version in my installation, and Arvid’s suggestion to manually remove the java.xml plugin from the target platform, via the Content tab.

A further issue was found in using Java8 which has a different provider for the JavaScript extension. As a result, the Bioclipse JavaScript console did not start. Apparently, my Java 8_11 installation in Eclipse does not provide any scripting environment (tested by asking the ScriptEngineManager for all engines; none were reported). Because the nashorn.jar that contains the implementation was provided by Oracle as a separate jar, containing an open source JavaScript engine originally by Mozilla, now provided via OpenJDK, I could include this jar in the Bioclipse plugin, solving these issues. Along with a few other patches, these tweaks are available in this branch on GitHub. These patches are not pushed for inclusion in the Bioclipse development branch.

The test suite was not extended and not run as “JUnit Plug-in Test” using Eclipse, because my development environment is not able to properly run these at this moment. Instead, the functionality was tested using the rdf manager from the JavaScript console with this script:

store = rdf.createInMemoryStore()

rdf.addObjectProperty(store,
  "http://example.com/#subject",
  "http://example.com/#predicate",
  "http://example.com/#object"
);
rdf.addObjectProperty(store,
  "http://example.com/#subject",
  "http://example.com/#predicate",
  "http://example.com/#object2"
);
secondStore = rdf.createInMemoryStore();
rdf.addObjectProperty(secondStore,
  "http://example.com/#subject",
  "http://example.com/#predicate",
  "http://example.com/#object"
);
rdf.addDataProperty(secondStore,
  "http://example.com/#subject",
  "http://example.com/#predicate",
  "someDataObject"
);

unionStore = rdf.union(store, secondStore);
diffStore = rdf.difference(store, secondStore);
intersectStore = rdf.intersection(store, secondStore);
rdf.asTurtle(diffStore)

This showed expected results, with the exception that the Jena code makes default triples more visible. That is, converting the store to Turtle shows two triples, even though it has an additional 39~ish additional triples from the RDF and RDF Schema specifications. Weirdly, making a union of the store and secondStore, the number of triples increases to about 150 and converting this to Turtle does serialize all those RDF and RDF Schema triples. I have been unable to work around this feature.

The above three methods are easily wrapped, but the shortest path functionality requires an additional step: the OntTool return value is a toolkit specific type (Path) and the rdf manager was designed to convert this to a java.util.List of Strings. This functionality too was tested via the JavaScriptConsole:

store = rdf.createInMemoryStore()

rdf.addObjectProperty(store,
  "http://example.com/#subject",
  "http://example.com/#predicate",
  "http://example.com/#subject2"
);
rdf.addObjectProperty(store,
  "http://example.com/#subject2",
  "http://example.com/#predicate",
  "http://example.com/#subject3"
);

rdf.shortestPath(store,
  "http://example.com/#subject",
  "http://example.com/#subject3"
);

The manager provides two variants at this moment of this shortestPath() method: the above exemplified version and one that takes a fourth parameter, being a String representation of an URI matching the only predicate that can be part of the path. Both methods were found to work as expected based on the above code. No application to larger data sets have been tried.

Resulting patches have been provided as a pull request.

Conclusion: Exposing the additional functionality yielded an more functional rdf manager with interesting new features.

EW1: Updating Bioclipse with OPSIN 1.6.0

Hypothesis: Bioclipse works just as well with OPSIN 1.6.0 as it does with 1.5.0.

Start date: 2014-07-20 End date: 2014-07-21

Description: Bioclipse in the development branch has OPSIN 1.5.0 exposed with the opsin manager. The intention of this experiment is to update Bioclipse with OPSIN 1.6.0, keeping the opsin manager methods working.

Methods

  • start with a Eclipse development environment including the bioclipse.cheminformatics repository
  • update the OPSIN version
  • test with the test suite
  • publish the patches

Report

I still had a working development environment around. As I installed Eclipse 4.4 a few days earlier, I opened the Eclipse workspace with this version, which triggered an irreversible upgrade of the workspace so that I cannot return to Eclipse 4.3. The test suite was run as “JUnit Plug-in Test” using Eclipse, defined by the AllOpsinManagerPluginTests class. This shows two fails in the APITest test class, related to @TestClass and @TestMethod annotation. Annotation was added and committed as a patch to ensure no fails were reported.

Then the opsin-1.6.0-excludingInChI-jar-with-dependencies.jar was downloaded from the OPSIN download page. This version was selected because the 1.5.0 version excluded the InChI bits too and these is already available from other Bioclipse plugins. The new jar was copied into the net.bioclipse.opsin’s jar/ folder and .classpatch, MANIFEST.MF, and build.properties were updated accordingly.

The result was successfully testing using the aforementioned AllOpsinManagerPluginTests class and by running Bioclipse itself and using the opsin manager from the JavaScript console with the command ui.open(opsin.parseIUPACName(“benzene”)).

The two patches were made available as pull request 46.

Bioclipse with OPSIN 1.6.0

Conclusion:

No special updated were needed and Bioclipse works with OPSIN 1.6.0 just as it did with OPSIN 1.5.0.

Pathway analysis for Malaria research

A recurrent theme in my blog is that an easy way to support Open Science is to just join the show. You do not have to contribute a lot to have some impact. Of course, sometimes what you do has more impact than other times. Sometimes something with initially little impact gets high impact later. This is hard to predict, but maybe as well as the stock exchange. In the past I have contributed effort to many Open projects, often small bits, some things never get noticed (like my Ant man page in Debian which is more than 10 years old :).

One project I have long wanted to contribute to, is the Open Source Malaria project, which is brilliantly led by Matt Todd. I had two principle ideas:

  1. use Bioclipse to run the Decision Support against the OSM compounds
  2. do pathway analysis on malaria data
  3. use the AMBIT-JS to put all the OSM compounds online as a HTML page
The first and third I still have not gotten around to finishing. The first is a very simple way for you to contribute. The key question here is just to see how the compounds can be made less toxic / have less side effects. And Bioclipse can visualize this easily, based on various toxicity models, among all those from OpenTox. Really, a four hour job.

PCA results from arrayanalysis.org
for the four sample groups.
The other task is more difficult, and I am really happy that Patricia Zaandam started a ten week internship with me to work on this task. She has been blogging her progress, and I strongly invite you to read her blog and comment (ask questions, post ideas, give criticism), as Open projects are driven by Open communication. Because WikiPathways has most pathways for human, Patricia looking at human expression data. And in five weeks time, she did the preprocessing of the raw data using arrayanalysis.org and did the pathways analysis using PathVisio, resulting in this shortlist of pathways. And now the hard part starts: biological and methodological validation of her approach.

There is plenty of room for feedback. I am not at all a malaria expert, and learning a lot from her study. Some questions we welcome expert input in (as independent test set validation, so to say):
  • what key pathways and genes do we expect to see for treated-versus-ill malaria patients
  • what transcriptomics/proteomics/metabolomics data do you like us to consider too
Etc, etc...

Revisited: Handling SD files with JavaScript in Bioclipse

After asking on the Bioclipse users list it turns out there was an unpublished manager method to trigger parsing of the SDF properties (Arvid++), allowing to simplify creation of the index and not needed parsing of the chemical structures into a CDK molecule model.

That simplifies my earlier code to:

  hmdbIndex =
    molTable.createSDFIndex(
      "/WikiPathways/hmdb.sdf"
    );
  props = new java.util.HashSet();
  props.add("HMDB_ID");
  molTable.parseProperties(hmdbIndex, props);
  
  idIndex = new java.util.HashMap();
  molCount = hmdbIndex.getNumberOfMolecules();
  for (i=0; i<molCount; i++) {
    hmdbID = hmdbIndex.getPropertyFor(i, "HMDB_ID")
    idIndex.put(hmdbID, i);
  }

The next step in my use case is process some input (WikiPathways GPML files to be precise), detect what HMDB identifier is used, extract the SD file entry for that identifier and append it to a new SD file (using a new ui.append() method):

  hmdbCounter = idIndex.get(idStr)
  sdEntry = hmdbIndex.getRecord(hmdbCounter)
  sdEntry = sdEntry.substring(0, sdEntry.indexOf("M  END"))
  ui.append("/WikiPathways/db.sdf", sdEntry);
  ui.append("/WikiPathways/db.sdf", "M  END\n");
  ui.append("/WikiPathways/db.sdf", "> <WPM>\n");
  ui.append("/WikiPathways/db.sdf", "WPM" + (Integer.toString(wpmId)).substring(1) + "\n");
  ui.append("/WikiPathways/db.sdf", "\n");
  ui.append("/WikiPathways/db.sdf", "\$\$\$\$\n");

This code actually does a bit more than copying the SD file entry: it also removes all previous SD fields and replace this with a new, internal identifier. Using that identifier, I track some metadata on this metabolite.

Now, there are a million ways of implementing this workflow. If you really want to know, I chose this one because HMDB identifiers is a more prominent ID used in WikiPathways, and for this one, as well as ChEBI, I can use a SD file. For ChemSpider and PubChem identifiers, however, I plan to use the matching Bioclipse client code to pull in MDL molfiles. Bioclipse has functionality for all these needs available as extensions. 

Handling SD files with JavaScript in Bioclipse

I finally got around to continuing with a task to create an SD file for WikiPathways. The problem is more finding the time, than doing it, and the tasks are basically:
  1. iterating over all metabolites in the GPML files
  2. extract the Xref's database and database identifier (see previous link)
  3. extract the molfile from the database SD file
  4. give the WikiPathways metabolite a unique identifier
  5. record that WikiPathways metabolite has a molfile
  6. append that molfile along with the new WikiPathways metabolite ID in a new SD file
It turns out that I can use Uppsala's excellent SD functionality in Bioclipse (using indexing, it opens 2 GB SD files for me) is also available from the JavaScript command line:

  hmdbIndex = molTable.createSDFIndex(
    "/WikiPathways/hmdb.sdf"
  );
  
  idIndex = new java.util.HashMap();
  molCount = hmdbIndex.getNumberOfMolecules();
  for (i=0; i<molCount; i++) {
    mol = hmdbIndex.getMoleculeAt(i);
    if (mol != null) {
      hmdbID = mol.getAtomContainer().getProperty(
        "HMDB_ID"
      );
      idIndex.put(hmdbID, i);
    }
  }

Using this approach, I can create an index by HMDB identifier of molfiles in the HMDB SD file extract just those molfiles which are found in WikiPathways, and create a new WikiPathways dedicated SD file. When I have the HMDB identifiers done, ChEBI, PubChem, and ChemSpider will follow.