Bioclipse 2.6.2 with recent hacks #2: reading content from online Google Spreadsheets

Update 2015-06-04: the authentication with the Google Drive has changed; I need to update the code and am afraid I missed the point, so that the below code is not working right now :(

Similar to the previous post in this new series, this post will outline how to make use of the Google Spreadsheet functionality in Bioclipse 2.6.2. But before I provide the steps needed to install the functionality, first consider this Bioclipse JavaScript:

  google.setUserCredentials(
    "your.account", "16charpassword"
  )
  google.listSpreadsheets()
  google.listWorksheets(
    "ORCID @ Maastricht University"
  )
  data = google.loadWorksheet(
    "ORCID @ Maastricht University",
    "with works"
  )

Because that's what this functionality: read data from Google Spreadsheets. That opens up an integration of Google Spreadsheets with your regular data analysis workflows. I am not sure of Bioclipse is the only tool that embeds the Google client code to access these services, and can imagine similar functionality is available from R, Taverna, and KNIME.

Getting your credentials
The first call to the google manager requires your login details. But don't use your regular password: you need a application password. This specific, sixteen character, password needs to be manually created using your webbrowser, following this link. Create a new App password (”Other (Customized name)” ) and use this password in Bioclipse.

Installing Bioclipse 2.6.2 and the Google Spreadsheet functionality
The first you need to do (unless you already did that, of course) is install Bioclipse 2.6.2 (the beta) and enable the advanced mode. This is outline in my previous post up to Step 1. The update site, obviously, is different, and in Step 2 in that post you should use:

  1. Name: Open Notebook Science Update Site
  2. Location: http://pele.farmbio.uu.se/jenkins/job/Bioclipse.ons/lastSuccessfulBuild/artifact/buckminster.output/net.bioclipse.ons_site_1.0.0-eclipse.feature/site.p2/
Yes, the links only seem to get longer and longer. Just continue to the next step and install the Google Feature:


That's it, have fun!

Oh, and this hack is not so recent. I wrote the first version of the net.bioclipse.google plugin and matching manager, as used in the above code, dates back to January 2011, when I had just started at the Karolinska Institutet. But the code to download data from spreadsheets is even older, and goes back to 2008 when I worked with Cameron Neylon and Pierre Lindenbaum on creating RDF for data being collected by Jean Claude-Bradley. If you're interested, check the repository history and this book chapter.

Dr. J. Alvarsson: Bioclipse 2, signature fingerprints, and chemometrics

Last Friday I attended the PhD defense of, now, Dr. Jonathan Alvarsson (Dept. Pharmaceutical Biosciences, Uppsala University), who defended his thesis Ligand-based Methods for Data Management and Modelling (PDF). Key papers resulting from his work include (see the list below) one about Bioclipse 2, particularly covering his work on plug-able managers that enrich scripting languages (JavaScript, Python, Groovy) with domain specific functionality, which I make frequent use of (doi:10.1186/1471-2105-10-397), a paper about Brunn, a LIMS system for microplates, which is based on Bioclipse 2 (doi:10.1186/1471-2105-12-179), and a set of chemometrics papers, looking at scaling up pattern recognition via QSAR model buildings (e.g. doi:10.1021/ci500361u). He is also author on several other papers and we collaborated on several of them, so you will find his name in several more papers. Check his Google Scholar profile.

In Sweden there is one key opponent, though further questions can be asked by a few other committee members. John Overington (formerly of ChEMBL) was the opponent and he asked Jonathan questions for at least an hour, going through the thesis. Of course, I don't remember most of it, but there were a few that I remember and want to bring up. One issue was about the uptake of Bioclipse by the community, and, for example, how large the community is. The answer is that this is hard to answer; there are download statistics and there is actual use.

Download statistics of the Bioclipse 2.6.1 release.
Like citation statistics (the Bioclipse 1 paper was cited close to 100 times, Bioclipse 2 is approaching 40 citations), download statistics reflect this uptake but are hardly direct measurements. When I first learned about Bioclipse, I realized that it could be a game changer. But it did not. I still don't quite understand why not. It looks good, is very powerful, very easy to extend (which I still make a lot of use of), it is fairly easy to install (download 2.6.1 or the 2.6.2 beta), etc. And it resulted in a large set of applications, just check the list of papers.

One argument could be, it is yet another tool to install, and developers are turning to web-based solutions. Moreover, the cheminformatics community has many alternatives and users seem to prefer smaller, more dedicated tools, like a file format converter, like Open Babel, or a dedicated descriptor calculator, like PaDEL. Simpler messages seem more successful; this is expected for politics, but I guess science is more like politics that we like to believe.

A second question I remember was about what Jonathan would like to see changed in ChEMBL, the database Overington has worked so long on. As a data analyst you are in a different mind set: rather than thinking about single molecules, you think about classes of compounds, and rather than thinking about the specific interaction of a drug with a protein, you think about the general underlying chemical phenomenon. A question like this one requires a different kind of thinking: it needs one to think like an analytical chemist, that worries about the underlying experiments. Obvious, but easy to return too once thinking at a higher (different) level. That experimental error information in ChEMBL can actually support modelling, is something we showed using Bayesian statistics (well, Martin Eklund particularly) in Linking the Resource Description Framework to cheminformatics and proteochemometrics (doi:10.1186/2041-1480-2-S1-S6) by including the assay confidence assigned by the ChEMBL curation team. If John would have asked me, I would have said I wanted ChEMBL to capture as much of the experimental details as possible.

Integration of RDF technologies in Bioclipse. Alvarsson worked on the integration of the RDF editor in Bioclipse.
The screenshot shows that if you click a RDF resource reflecting a molecule, it will show the structure (if there is a
predicte providing the SMILES) and information by predicates in general.
The last question I want to discuss was about the number of rotable bonds in paracetamol. If you look at this structure, you would identify four purely σ bonds (BTW, can you have π bonds without sigma bonds?). So, four could be the expected answer. You can argue that the peptide bond should not be considered rotatable, and should be excluded, and thus the answer would be two. Now, the CDK answers two, as shown in an example of descriptor calculation in the thesis. I raised my eyebrows, and thought: "I surely hope this is not a bug!". (Well, my thoughts used some more words, which I will not repeat here.)

But thinking about that, I valued the idea of Open Source: I could just checked, and took my tablet from my bag, opened up a browser, went to GitHub, and looked up the source code. It turned out it was not a bug! Phew. No, in fact, it turned out that the default parameters of this descriptor excludes the terminal rotatable bonds:


So, problem solved. Two follow up questions, though: 1. can you look up source code during a thesis defense? Jonathan had his laptop right in front of him. I only thought of that yesterday, when I was back home, having dinner with the family. 2. I wonder if I should discuss the idea of parameterizable descriptors more; what do you think? There is a lot of confusion about this design decision in the CDK. For example, it is not uncommon that the CDK only calculates some two hundred descriptor values, whereas tool X calculates more than a thousand. Mmmm, that always makes me question the quality of that paper in general, but, oh well...

There also was a nice discussion about chemometrics. Jonathan argues in his thesis that a fast modeling method may be a better way forward at this moment, and more powerful statistical methods. He presented results with LIBLINEAR and signature fingerprints, comparing it to other approaches. The latter was compared with industry standards, like ECFP (which Clark and Ekins implemented for the CDK and have been using in Bayesian statistics on the mobile phone), and for the first Jonathan showed that LINLINEAR can handle more data than regular SVM libraries, and that the using more training data still improves the model more than a "better" statistical method (which quite matches my own experiences). And with SVMs, finding the right parameters typically is an issue. Using a RBF kernel only adds one, and since Jonathan also indicated that the Tanimoto distance measure for fingerprints is still a more than sufficient approach, which makes me wonder if the chemometrics models should not be using a Tanimoto kernel instead of a RBF kernel (though doi:10.1021/ci800441c suggests RBF may really do better for some tasks, but at the expense of more parameter optimization needed).

To wrap up, I really enjoyed working with Jonathan a lot and I think he did excellent multidisciplinary work. I am also happy that I was able to attend his defense and the events around that. In no way does this post do justice or reflect the defense; it merely reflects that how relevant his research is in my opinion, and just highlights some of my thoughts during (and after) the defense.

Jonathan, congratulations!

Spjuth, O., Alvarsson, J., Berg, A., Eklund, M., Kuhn, S., Mäsak, C., Torrance, G., Wagener, J., Willighagen, E. L., Steinbeck, C., Wikberg, J. E., Dec. 2009. Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 10 (1), 397+.
Alvarsson, J., Andersson, C., Spjuth, O., Larsson, R., Wikberg, J. E. S., May 2011. Brunn: An open source laboratory information system for microplates with a graphical plate layout design process. BMC Bioinformatics 12 (1), 179+.
Alvarsson, J., Eklund, M., Engkvist, O., Spjuth, O., Carlsson, L., Wikberg, J. E. S., Noeske, T., Oct. 2014. Ligand-Based target prediction with signature fingerprints. J. Chem. Inf. Model. 54 (10), 2647-2653.

Bioclipse 2.6.2 with recent hacks #3: using functionality from the OWLAPI

Update: if you had problems installing this feature, please try again. Two annoying issues have been fixed now.

Third in this series is this post about the Bioclipse plugin I wrote for the OWLAPI library. This manager I wrote to learn how the OWLAPI is working. The OWLAPI feature is available from the same update site as the Linked Data Fragments, so you can just follow the steps outlined here (if you had not already).

Using the manager
The manager has various methods, for example, for loading an OWL ontology:

ontology = owlapi.load(
  "/eNanoMapper/enanomapper.owl", null
);

If your ontology imports other ontologies, you may need to tell the OWLAPI first where to find those, by defining mappings. For example, I could do before making the above call:

mapper = null; // initially no mapper
mapper = owlapi.addMapping(mapper,
  "http://purl.bioontology.org/ontology/npo",
  "/eNanoMapper/npo-asserted.owl"
);
mapper = owlapi.addMapping(mapper,
  "http://www.enanomapper.net/ontologies/" + 
  "external/ontology-metadata-slim.owl",
  "/eNanoMapper/ontology-metadata-slim.owl"
)

I can list which ontologies have been imported with:

imported = owlapi.getImportedOntologies(ontology)
for (var i = 0; i < imported.size(); i++) {
  js.say(
    imported.get(i).getOntologyID().getOntologyIRI()
  )
}

When the ontology is successfully loaded, I can list the classes and various types of properties:

classes = owlapi.getClasses(ontology)
annotProps = owlapi.getAnnotationProperties(ontology)
declaredProps = owlapi.getPropertyDeclarationAxioms(ontology)

There likely needs some further functionality that needs adding, and I love to hear about what methods you like to see added.

CDK Literature #6

Originally a series I started in the CDK News, later for some issues part of this blog, and then for some time on Google+, CDK Literature is now returning to my blog. BTW, I created a poll about whether CDK News should be picked up again. The reason why we stopped was that we were not getting enough submissions anymore.

For those who are not familiar with the CDK Literature series, the posts discuss recent literature that cites one of the two CDK papers (the first one is now Open Access). A short description explains what the paper is about and why the CDK is cited. For that I am using the CiTO, of which the data is available from CiteULike. That allows me to keep track how people are using the CDK, resulting, for example, in these wordles.

I will try to pick up this series again, but may be a bit more selective. The number of CDK citing papers has grown extensively, resulting in at least one new paper each week (indeed, not even close to the citation rate of DAVID). I aim at covering ~5 papers each week.

Ring perception
Ring perception has evolved in the CDK. Originally, there was the Figueras algorithm (doi:10.1021/ci960013p) implementation which was improved by Berger et al. (doi:10.1007/s00453-004-1098-x). Now, John May (the CDK release manager) has reworked the ring perception in the CDK, also introduction a new API which I covered recently. Also check John's blog.

May, J. W., Steinbeck, C., Jan. 2014. Efficient ring perception for the chemistry development kit. Journal of Cheminformatics 6 (1), 3+. URL http://dx.doi.org/10.1186/1758-2946-6-3

Screening Assistant 2
A bit longer ago, Vincent Le Guilloux published the second version their Screening Assistant tool fo rmining large sets of compounds. The CDK is used for various purposes. The paper is already from 2012 (I am that much behind with this series) and the source code on SourceForge does not seem to have change much recently.

Figure 2 of the paper (CC-BY) shows an overview of the Screening Assistant GUI.
Guilloux, V. L., Arrault, A., Colliandre, L., Bourg, S., Vayer, P., Morin-Allory, L., Aug. 2012. Mining collections of compounds with screening assistant 2. Journal of Cheminformatics 4 (1), 20+. URL http://dx.doi.org/10.1186/1758-2946-4-20

Similarity and enrichment
Using fingerprints for compound enrichment, i.e. finding the actives in a set of compounds, is a common cheminformatics application. This paper by Avram et al. introduces a new metric (eROCE). I will not go into details, which are best explained by the paper, but note that the CDK is used via PaDEL and that various descriptors and fingerprints are used. The data set they used to show the performance is one of close to 50 thousand inhibitors of ALDH1A1.

Avram, S. I., Crisan, L., Bora, A., Pacureanu, L. M., Avram, S., Kurunczi, L., Mar. 2013. Retrospective group fusion similarity search based on eROCE evaluation metric. Bioorganic & Medicinal Chemistry 21 (5), 1268-1278. URL http://dx.doi.org/10.1016/j.bmc.2012.12.041

The International Chemical Identifier
It is only because Antony Williams advocated the importance of the InChI in this excellent slides that I list this paper again: I covered it here in more detail already. The paper describes work by Sam Adams to wrap the InChI library into a Java library, how it is integrated in the CDK, and how Bioclipse uses it. It does not formally cite the CDK, which now feels silly. Perhaps I did not add because of fear of self-citation? Who knows. Anyway, you find this paper cited on slide 30 in aforementioned presentation from Tony.

Spjuth, O., Berg, A., Adams, S., Willighagen, E., 2013. Applications of the InChI in cheminformatics with the CDK and bioclipse. Journal of Cheminformatics 5 (1), 14+. URL http://dx.doi.org/10.1186/1758-2946-5-14

Predictive toxicology
Cheminformatics is a key tool in predictive toxicology. I starts with the assumption that compounds of similar structure, behave similarly when coming in contact with biological systems. This is a long-standing paradigm which turns out to be quite hard to use, but has not shown to be incorrect either. This paper proposes a new approach using Pareto points and used the CDK to calculate logP values for compounds. However, I cannot find which algorithm it is using to do so.

Palczewska, A., Neagu, D., Ridley, M., Mar. 2013. Using pareto points for model identification in predictive toxicology. Journal of Cheminformatics 5 (1), 16+. URL http://dx.doi.org/10.1186/1758-2946-5-16

Cheminformatics in Python
ChemoPy is a tool to do cheminformatics in Python. This paper cites the CDK just as one of the tools available for cheminformatics. The tool is available from Google Code. It has not been migrated yet, but they still have about half a year to do so. Then again, given that there does not seem to have been activity since 2013, I recommend looking at Cinfony instead (doi:10.1186/1752-153X-2-24): exposed the CDK and is still maintained.

Cao, D.-S., Xu, Q.-S., Hu, Q.-N., Liang, Y.-Z., Apr. 2013. ChemoPy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29 (8), 1092-1094. URL http://dx.doi.org/10.1093/bioinformatics/btt105

Bioclipse 2.6.2 with recent hacks #1: Wikidata & Linked Data Fragments

Bioclipse dialog to upload chemical
structures to an OpenTox repository.
Us chem- and bioinformaticians have it easy when it comes to Open Science. Sure, writing documentation, doing unit testing, etc, takes a lot of time, but testing some new idea is done easily. Yes, people got used to that, so trying to explain that doing it properly actually takes long (documentation, unit testing) can be rather hard.

Important for this is a platform that allows you to easy experiment. For many biologists this environment is R or Python. To me, with most of the libraries important to me written in Java, this is Groovy (e.g. see my Groovy Cheminformatics book) and Bioclipse (doi:10.1186/1471-2105-8-59). Sometimes these hacks grow to be full papers, like with what started with OpenTox support (doi:10.1186/1756-0500-4-487) which even paved (for me) the way to the eNanoMapper project!

But often these hacks are just for me personal, or at least initially. However, I have no excuse to not make this available to a wider audience too. Of course, the source code is easy, and I normally have even the smallest Bioclipse hack available somewhere on GitHub (look for bioclipse.* repositories). But it is getting even better, now that Arvid Berg (Bioclipse team) gave me the pointers to ensure you can install those hacks, taking advantage from Uppsala's build system.

So, from now on, I will blog how to install Bioclipse hacks I deem useful for a wider audience, starting with this post on my Wikidata/Linked Data Fragments hack I used to get more CAS registry number mappings to other identifiers.

Install Bioclipse 2.6.2
The first thing you need is Bioclipse 2.6.2. That's the beta release of Bioclipse, and required for my hacks. From this link you can download binary nightly builds for GNU/Linux, MS-Windows, and OS/X. For the first two 32 and 64 bit build are available. You may need to install Java and version 1.7 should do fine. Unpack the archive, and then start the Bioclipse executable. For example, on GNU/Linux:

  $ tar zxvf Bioclipse.2.6.2-beta.linux.gtk.x86.tar.gz
  $ cd Bioclipse.2.6.2-beta/
  $ ./bioclipse

Install the Linked Data Fragments manager
The default update site already has a lot of goodies you can play with. Just go to Install → New Feature.... That will give you a nice dialog like this one (which allows you to install the aforementioned Bioclipse-OpenTox feature):



But that update site doesn't normally have my quick hacks. This is where Arvid's pointers come in, which I hope to carefully reproduce here so that my readers can install other Bioclipse extensions too.

Step 1: enable the 'Advanced Mode'
The first step is to enable the 'Advanced Mode'; that is, unless you are advanced, forget about this. Fortunately, the fact that you haven't given up on reading my blog yet is a good indicated you are advanced. Go to the Window → Preferences menu and enable the 'Advanced Mode' in the dialog, as shown here:


When done, click Apply and close the dialog with OK.

Step 2: add an update site from the Uppsala build system
The first step enables you to add arbitrary new update sites, like update sites available from the Uppsala build system, by adding a new menu option. To add new update sites, use this new menu option and select Install → Software from update site...:


By clicking the Add button, you go this dialog where you should enter the update site information:


This dialog will become a recurrent thing in this series, though the content may change from time to time. The information you need to enter is (the name is not too important and can be something else that makes sense to you):

  1. Name: Bioclipse RDF Update Site
  2. Location: http://pele.farmbio.uu.se/jenkins/job/Bioclipse.rdf/lastSuccessfulBuild/artifact/site.p2/

After clicking OK in the above dialog, you will return to the Available Software dialog (shown earlier).

Step 3: installing the Linked Data Fragments Feature
The  Available Software dialog will now show a list of features available from the just added update site:


You can see the Linked Data Fragments Feature is now listed which you can select with the checkbox in front of the name (as shown above). The Next button will walk you through a few more pages in this dialog, providing information about dependencies and a page that requires you to accept the Open Source licenses involved. And at the end of these steps, it may require you to reboot Bioclipse.

Step 4: opening the JavaScript Console and verify the new extension is installed
Because the Linked Data Fragments Feature extends Bioclipse with a new, so-called manager (see doi:10.1186/1471-2105-10-397), we need to use the JavaScript Console (or Groovy Console, or Python Console, if you prefer those languages). Make sure the JavaScript Console is open, or do this via the menu Windows → Show View → JavaScript Console and type in the console view man ldf which should result in something like this:


You can also type man ldf.createStore to get a brief description of the method I used to get a Linked Data Fragments wrapper for Wikidata in my previous post, which is what you should reread next.

Have fun and looking forward to hear how you use Linked Data Fragments with Bioclipse!

Getting CAS registry numbers out of WikiData

doi 10.15200/winn.142867.72538

I have promised my Twitter followers the SPARQL query you have all been waiting for. Sadly, you had to wait for it for more than two months. I'm sorry about that. But, here it is:
    PREFIX wd: <http://www.wikidata.org/entity/>

    SELECT ?compound ?id WHERE {
      ?compound wd:P231s [ wd:P231v ?id ] .
    }
What this query does is ask for all things (let's call whatever is behind the identifier is a "compound"; of course, it can be mixtures, ill-defined chemicals, nanomaterials, etc) that have a CAS registry identifier. This query results in a nice table of Wikidata identifiers (e.g. Q47512 is acetic acid) and matching CAS numbers, 16298 of them.

Because Wikidata is not specific to the English Wikipedia, CAS numbers from other origin will show up too. For example, the CAS number for N-benzylacrylamide (Q10334928) is provided by the Portuguese Wikipedia:


I used Peter Ertl's cheminfo.org (doi:10.1186/s13321-015-0061-y) to confirm this compound indeed does not have an English page, which is somewhat surprising.

The SPARQL query uses a predicate specifically for the CAS registry number (P231). Other identifiers have similar predicates, like for PubChem compound (P662) and Chemspider (P661). That means, Wikidata can become a community crowdsource of identifier mappings, which is one of the things Daniel Mietchen, me, and a few others proposed in this H2020 grant application (doi:10.5281/zenodo.13906). The SPARQL query is run by the Linked Data Fragments platform, which you should really check out too, using the Bioclipse manager I wrote around that.

The full Bioclipse script looks like:
    wikidataldf = ldf.createStore(
      "http://data.wikidataldf.com/wikidata"
    )

    // P231 CAS
    identifier = "P231"
    type = "cas"

    sparql = """
    PREFIX wd:

    SELECT ?compound ?id WHERE {
      ?compound wd:${identifier}s [ wd:${identifier}v ?id ] .
    }
    """
    mappings = rdf.sparql(wikidataldf, sparql)

    // recreate an empty output file
    outFilename = "/Wikidata/${type}2wikidata.csv"
    if (ui.fileExists(outFilename)) {
      ui.remove(outFilename)
      ui.newFile(outFilename)
    }

    // safe to a file
    for (i=1; i<=mappings.rowCount; i++) {
      wdID = mappings.get(i, "compound").substring(3)
      ui.append(
        outFilename,
        wdID + "," + mappings.get(i, "id") + "\n"
      )
    }
BTW, of course, all this depends on work by many others including the core RDF generation with the Wikidata Toolkit. See also the paper by Erxleben et al. (PDF).


Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandečić, D., 2014. Introducing wikidata to the linked data web. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (Eds.), The Semantic Web – ISWC 2014. Vol. 8796 of Lecture Notes in Computer Science. Springer International Publishing, pp. 50-65. URL http://dx.doi.org/10.1007/978-3-319-11964-9_4

Mietchen, D., Others, M., Anonymous, Hagedorn, G., Jan. 2015. Enabling open science: Wikidata for research. URL http://dx.doi.org/10.5281/zenodo.13906

Ertl, P., Patiny, L., Sander, T., Rufener, C., Zasso, M., Mar. 2015. Wikipedia chemical structure explorer: substructure and similarity searching of molecules from Wikipedia. Journal of Cheminformatics 7 (1), 10+. URL http://dx.doi.org/10.1186/s13321-015-0061-y

Open Notebook Science ONSSP #1: http://onsnetwork.org/

As promised, I slowly set out to explore ONSSPs (Open Notebook Science Service Providers). I do not have a full overview of solutions yet but found LabTrove and Open Notebook Science Network. The latter is a more clear ONSSP while the first seems to be the software.

So, my first experiment is with Open Notebook Science Network (ONSN). The platform uses WordPress, a proven technology. I am not a huge fan of the set up which has a lot of features making it sometimes hard to find what you need. Indeed, my first write up ended up as a Page rather than a Post. On the upside, there is a huge community around it, with experts in every city (literally!). But my ONS is now online and you can monitor my Open research with this RSS feed.

One of the downsides is that the editor is not oriented at structured data, though there is a feature for Forms which I may need to explore later. My first experiment was a quick, small hack: upgrade Bioclipse with OPSIN 1.6. As discussed in my #jcbms talk, I think it may be good for cheminformatics if we really start writing up step-by-step descriptions of common tasks.

My first observations are that it is an easy platform to work with. Embedding images is easy, and there should be option for chemistry extensions. For example, there is a Jmol plugin for WordPress, there are plugins for Semantic Web support (no clue which one I would recommend), an extensions for bibliographies are available too, if not mistaken. And, we also already see my ORCID prominently listed, and I am not sure if I did this, or whether this the ONSN people added this as a default feature.

Even better is the GitHub support @ONScience made me aware of, by @benbalter. The instructions were not crystal clear to me (see issues #25 and #26), some suggested fixes (pull request #27), it started working, and I now have a backup of my ONS at GitHub!

So, it looks like I am going to play with this ONSSP a lot more.

First steps in Open Notebook Science

Scheme 2 from this Beilstein Journal of Organic
Chemistry paper
by Frank Hahn et al.
I blogged a few weeks back I blogged about my first Open Notebook Science entry. The post suggest I will look at a few ONS service providers, but, honestly, Open Notebook Science Network serves my needs well.

What I have in mind, and will soon advocate, is that the total synthesis approach from organic chemistry fits chem- and bioinformatics research. It may not be perfect, and perhaps somewhat artificial (no pun intended), but I like the idea.

Compound to Compound
Basically, a lab notebook entry should be a step of something larger. You don't write Bioclipse from scratch. You don't do a metabolomics pathway enrichment analysis in one step, either. It's steps, each one taking you from one state to another. Ah, another nice analogy (see automata theory)! In terms of organic chemistry, from one compound to another. The importance here is that the analogy shows that there is no step you should not report. The same applies to cheminformatics: you cannot report a QSAR model without explaining how your cleaned up that SDF file you got from paper X (which still commonly is practised).

Methods Sections
Organic chemistry literature has well-defined templates on how to report the method for a reaction, including minimal reporting standards for the experimental results. For example, you must report chemical shifts, an elemental composition. In cheminformatics we do not have such templates, but there is no reason not too. Another feature that must be reported is the yield.

Reaction yield
The analogy with organic chemistry continues: each step has a yield. We must report this. I am not sure how, and this is one of the things I am exploring and will be part of my argument. In fact, the point of keeping track of variance introduced is something I have been advocating for longer. I think it really matters. We, as a research field, now publish a lot of cheminformatics and chemometrics work, without taking into account the yield of methods, though, for obvious reasons, very much more in chemometrics than in cheminformatics. I won't go into that now, but there is indeed a good part of benchmark work, but the point is, any cheminformatics "reaction" step should be benchmarked.

Total synthesis
The final aspect is, is that by taking this analogy, there is a clear protocol how cheminformatics, or bioinformatics, work must be reported: as a sequence of detailed small steps. It also means that intermediate "products" can be continued with in multiple ways: you get a directed graph of methods you applied and results you got.

You get something like this:

Created with Graphviz Workspace.

The EWx codes refer to entries in my lab notebook:
  1. EW4: Finding nodes in Anopheles gambiae pathways with IUPAC names
  2. EW5: Finding nodes in Homo sapiens pathways with IUPAC names
  3. EW6: Finding nodes in Rattus norvegicus pathways with IUPAC names
  4. EW7: converting metabolite Labels into DataNodes in WikiPathways GPML

Open Notebook Science
Of course, the above applies also if you do not do Open Notebook Science (ONS). In fact, the above outline is not really different from how I did my research before. However, I see value in using the ONS approach here. By having it Open, it

  1. requires me to be as detailed as possible
  2. allows others to repeat it
Combine this with the advantage of the total synthesis analogy:
  1. "reactions" can be performed in reasonable time
  2. easy branching of the synthesis
  3. clear methodology that can be repeated for other "compounds
  4. step towards minimal reporting standards for cheminformatics methods
  5. clear reporting structure that is compatible with journal requirements
OK, that is more or less the paper I want to write up and submit to the Jean-Claude Bradley Memorial Issue in the Journal of Cheminformatics and Chemistry Central. It is an idea, something that helps me, and I hope more people find useful bits in this approach.

EW6: Finding nodes in Rattus norvegicus pathways with IUPAC names

Hypothesis: Rattus norvegicus pathways in WikiPathways have DataNode’s with labels containing IUPAC names which can be tagged as type Metabolite.

Start date: 2014-09-05 End date: 2014-09-05

Description:

WikiPathways entries in GPML have DataNode objects and Label objects. It was found before [here, here] that metabolites can be encoded in pathways is Label objects and therefore not machine-readable as Metabolite-type DataNode and unable to have a database identifier. As such, these metabolites are unusable for pathway analysis of metabolomics data.

By processing these GPML files (they are XML-based) and iterating over all Label’s we can attempt to convert this label into chemical structure with OPSIN. This goes under the assumption that if OPSIN can parse the label into a structure, it is one. This label will be recorded along with the pathway identifier for manual inspection. For each structure it will also look up a ChemSpider identifier.

Methods

Unchanged protocol.

  • Download the GPML files from WikiPathways
  • Get a working Bioclipse development version (hard) with the OPSIN, InChI, and ChemSpider extensions
  • A Groovy script to iterate over the GPML, find <Label> elementsEach <Label> is parsed with OPSIN and if successful, generate an InChI
  • Use the InChIs to find ChemSpider identifiers
  • Output all as a text file and open metabolites in a Structure table

Report 

Similar to the experiment for Anopheles gambiae and Homo sapiens only curated pathways were analyzed, 143 in total, downloaded from WikiPathways.org on August 24. The Groovy script is used detailed in this experiment.

ratWP

 

The script found 47 Labels that are possibly metabolites in 8 different rat pathways. The full list was uploaded to Gist.

Conclusion: Rat pathways also include metabolites encoded in GPML <Label> elements.

EW5: Finding nodes in Homo sapiens pathways with IUPAC names

Hypothesis: Homo sapiens pathways in WikiPathways have DataNode’s with labels containing IUPAC names which can be tagged as type Metabolite.

Start date: 2014-09-01 End date: 2014-09-01

Description: WikiPathways entries in GPML have DataNode objects and Label objects. It was found before [here] that metabolites can be encoded in pathways is Label objects and therefore not machine-readable as Metabolite-type DataNode and unable to have a database identifier. As such, these metabolites are unusable for pathway analysis of metabolomics data.

By processing these GPML files (they are XML-based) and iterating over all Label’s we can attempt to convert this label into chemical structure with OPSIN. This goes under the assumption that if OPSIN can parse the label into a structure, it is one. This label will be recorded along with the pathway identifier for manual inspection. For each structure it will also look up a ChemSpider identifier.

Methods

  • Download the GPML files from WikiPathways
  • Get a working Bioclipse development version (hard) with the OPSIN, InChI, and ChemSpider extensions
  • A Groovy script to iterate over the GPML, find <Label> elements
  • Each <Label> is parsed with OPSIN and if successful, generate an InChI
  • Use the InChIs to find ChemSpider identifiers
  • Output all as a text file and open metabolites in a Structure table

Report 

metabolitesHuman

Similar to the experiment for Anopheles gambiae only curated pathways were analyzed, some 266 in total, downloaded from WikiPathways.org on August 24. The previous Groovy script was updated to point to the human pathways, but also to output the results in a file, rather than STDOUT. The new script was uploaded to myExperiment.org.

The script found 42 Labels that are possibly metabolites. The full list was uploaded to Gist. Again, labels were found which could not be linked to a single ChemSpider ID. For example, “5b-Pregnane-3,20-dione” which will results in these ChemSpider search hits: 21427590, 389575, 21232692, 21239075, 21237402. The result file also shows a few labels with new lines.

One metabolite was manually confirmed in WP1449Imidazoquinolin. Interestingly, the Label was visually “connected” with “(anti-viral compounds)” which have a ChEBI identifier and could be converted to a DataNode of type Metabolite too:

metabolitesHuman1

Most work, however, needs to be done in the Tryptophan metabolism pathway (WP465); many metabolites are not properly made machine readable.

Conclusion:

Human pathways also include metabolites encoded in GPML <Label> elements, even in the curated subset.