OpenTox Euro 2016: "Data integration with identifiers and ontologies"

Results from a project by MSP students.
J. Windsor et al. (2016): Volatile Organic Compounds:
A Detailed Account of Identity, Origin,
Activity and Pathways
. Figshare.
A few weeks ago OpenTox Euro 2016 meeting was held in Rheinfelden at the German/Swiss border (which allowed me a nice stroll across the Rhine into Switzerland and by a nice x-mas countdown clock. The meeting was co-located with eNanoMapper-hosted meetings, where we discussed, among other things the nanoinformatics roadmaps, that outline where research in this area should go to.

There were many interesting talks, around various data initiatives, adverse outcome pathways (AOPs) and their links to molecular initiating events (MIEs), and ontologies (like the AOP ontology talk by ). In fact, I quite enjoyed the discussion with Chris Grulke about ontologies during the panel discussion. Central was, where is the border between data and ontological concepts. Some slides are available via Lanyrd.

During the Emerging Methods and Practice session hosted by Ola Spjuth, I presented the work at the BiGCaT department into identifier mapping and the use of ontologies for linking data sets.


Data integration with identifiers and ontologies from Egon Willighagen

The presentation integrates a lot of things I have been working on in the last few years, and please note the second slide with all people I have worked with on things presented in these slides.

New paper: "SPLASH, a hashed identifier for mass spectra"

I'm excited to have contributed to this important (IMHO) interoperability paper around metabolomics data: "SPLASH, a hashed identifier for mass spectra" (doi:10.1038/nbt.3689, readcube:msZj). A huge thanks to all involved in the great collaborative project! The source code project is fully open source and coordinated by Gert Wolgemuth, the lead author on this paper. It provides an implementation of the algorithm in various programming languages and I'm happy that the splash functionality is available in the just released Bioclipse 2.6.2 (taking advantage of the Java library). An R package by Steffen Neumann is also available.

This new identifier greatly simplifies linking between spectral databases and will in the end contribute to a Linked Data network. Furthermore, journals can start adopting this identifier and list the 'splash' for mass spectra in document, allowing for simplified dereplication and finding additional information around spectra.

There are several databases that have adopted the SPLASH already, such as MassBank, HMDB, MetaboLights, and the OSDB published in JCheminf recently (doi:10.1186/s13321-016-0170-2).


Screenshot snippet of a spectrum in the OSDB.

PS. I personally don't like the idea of ReadCubes (which I may blog about at some point) and how they have been pitched as a "legal" way of sharing papers, but this journal does not have a gold Open Access option, unfortunately.

Wohlgemuth, G., Mehta, S. S., Mejia, R. F., Neumann, S., Pedrosa, D., Pluskal, T., Schymanski, E. L., Willighagen, E. L., Wilson, M., Wishart, D. S., Arita, M., Dorrestein, P. C., Bandeira, N., Wang, M., Schulze, T., Salek, R. M., Steinbeck, C., Nainala, V. C., Mistrik, R., Nishioka, T., Fiehn, O., Nov. 2016. SPLASH, a hashed identifier for mass spectra. Nature Biotechnology 34 (11), 1099-1101.
http://dx.doi.org/10.1038/nbt.3689

Comparing sets of identifiers: the Bioclipse implementation

Source: Wikipedia
The problem
That sounds easy: take two collection of identifiers, put them in sets, determine the intersection, done. Sadly, each collection uses identifiers from different databases. Worse, within one set identifiers from multiple databases. Mind you, I'm not going full monty, though some chemistry will be involved at some point. Instead, this post is really based on identifiers.

The example
Data set 1:

Data set 2: all metabolites from WikiPathways. This set has many different data sources, and seven provide more than 100 unique identifiers. The full list of metabolite identifiers is here.

The goal
Determine the interaction of two collections of identifiers from arbitrary databases, ultimately using scientific lenses. I will develop at least two solutions: one based on Bioclipse (this post) and one based on R (later).

Needs
First of all, we need something that links IDs in the first place. Not surprisingly, I will be using BridgeDb (doi:10.1186/1471-2105-11-5) for that, but for small molecules alternatives exist, like the Open PHACTS IMS based on BridgeDb, the Chemical Translation Service (doi:10.1093/bioinformatics/btq476) or UniChem (doi:10.1186/s13321-014-0043-5, doi:10.1186/1758-2946-5-3).

The Bioclipse implementation
The first thing we need to do is read the files. I have them saved as CSV even though it is a tab-separated file. Bioclipse will now open it in it's matrix editor (yes, I think .tsv needs to be linked to that editor, which does not seem to be the case yet). Reading the human metabolites from WikiPathways is done with this code (using Groovy as scripting language):

file1 = new File(
  bioclipse.fullPath(
    "/Compare Identifiers/human_metabolite_identifiers.csv"
  )
)
set1 = new java.util.HashSet();
file1.eachLine { line ->
  fields = line.split(/\t/)
  def syscode;
  def id;
  if (fields.size() >= 2) {
    (syscode, id) = line.split(/\t/)
  }
  if (syscode != "syscode") { // ok, not the first line
    set1.add(bridgedb.xref(id, syscode))
  }
}

You can see that I am using the BridgeDb functionality already, to create Xref objects. The code skips the first line (or any line with "column headers"). The BridgeDb Xref object's equals() method ensures I only have unique cross references in the resulting set.

Reading the other identifier set is a bit trickier. First, I manually changed the second column, to use the BridgeDb system codes. The list is short, and saves me from making mappings in the source code. One thing I decide to do in the source code is normalize the ChEBI identifiers (something that many of you will recognize):

file2 = new File(
  bioclipse.fullPath("/Compare Identifiers/set.csv")
)
set2 = new java.util.HashSet();
file2.eachLine { line ->
  fields = line.split(/\t/)
  def name;
  def syscode;
  def id;
  if (fields.size() >= 3) {
    (name, syscode, id) = line.split(/\t/)
  }
  if (syscode != "syscode") { // ok, not the first line
    if (syscode == "Ce") {
      if (!id.startsWith("CHEBI:")) {
        id = "CHEBI:" + id
      } 
    }
    set2.add(bridgedb.xref(id, syscode))
  }
}

Then, the naive approach that does not take into account identifier equivalence makes it easy to list the number of identifiers in both sets:

intersection = new java.util.HashSet();
intersection.addAll(set1);
intersection.retainAll(set2)

println "set1: " + set1.size()
println "set2: " + set2.size()
println "intersection: " + intersection.size()

This reports:

set1: 2584
set2: 6
intersection: 3

With the following identifiers in common:

[Ce:CHEBI:30089, Ce:CHEBI:15904, Ca:25513-46-6]

Of course, we want to use the identifier mapping itself. So, we first compare identifiers directly, and if not matching, use BridgeDb and an metabolite identifier mapping database (get one here):

mbMapper = bridgedb.loadRelationalDatabase(
  bioclipse.fullPath(
    "/VOC/hmdb_chebi_wikidata_metabolites.bridge"
  )
)

intersection = new java.util.HashSet();
for (id2 in set2) {
  if (set1.contains(id2)) {
    // OK, direct match
    intersection.add(id2)
  } else {
    mappings = bridgedb.map(mbMapper, id2)
    for (mapped in mappings) {
      if (set1.contains(mapped)) {
        // OK, direct match
        intersection.add(id2)
      }
    }
  }
}

This gives five matches:

[Ch:HMDB00042, Cs:5775, Ce:CHEBI:15904, Ca:25513-46-6, Ce:CHEBI:30089]

The only metabolite it did not find in any pathway is the KEGG identified metabolite, homocystine. I just added this compound to Wikidata. That means that in the next metabolite mapping database, it will recognize this compound too.

The R and JavaScript implementations
I will soon write up the R version in a follow up post (but got to finish grading student reports first).

Splitting up Bioclipse Groovy scripts

Source: Wikipedia, CC-BY-SA 3.0
... without writing additional script managers (see doi:10.1186/1471-2105-10-397). That was what I was after. I found that by using evaluate() you could load additional code. Only requirements, you wrap stuff in a class, and the filename need to match the class name. So, you put stuff in a class SomeName and safe that in a Bioclipse project (e.g. SomeProject/) with the name SomeName.groovy.

That is, I have this set up:

  SomeProject/
    SomeClass.groovy
    aScript.groovy

Then, in this aScript.groovy you can include the following code to load that class and make use of the content:

  someClass = evaluate(
    new File(
      bioclipse.fullPath("/SomeProject/SomeClass.groovy")
    )
  )

Maybe there are even better ways, but this works for me. I tried the regular Groovy way of instantiating a class defined like this, but because the Bioclipse Groovy environment does not have a working directory, I could not get that to work.

Migrating pKa data from DrugMet to Wikidata

In 2010 Samuel Lampa and I started a pet project: collecting pKa data: he was working on RDF extension of MediaWiki and I like consuming RDF data. We started DrugMet. When you read this post, this MediaWiki installation may already be down, which is why I am migrating the data to Wikidata. Why? Because data curation takes effort, I like to play with Wikidata (see this H2020 proposal by Daniel Mietchen et al.), I like Open Data (see ), and it still much needed.

We opted for a page with the minimal amount of information. To maximize the speed at which we could add information. However, when it came to semantics, we tried to be as explicit as possible, and, e.g. use the CHEMINF ontology. So, it collected:
  1. InChIKey (used to show images)
  2. the paper it was collected from (identified by a DOI)
  3. the value, and where possible, the experimental error
A page typically looks something like this:


While not used on all pages, at some point I even started using templates, and I used these two, for molecules and papers:

    {{Molecule
      |Name=
      |InChIKey=
      |DOI=
      |Wikidata=
    }}

    {{Paper
      |DOI=
      |Year=
      |Wikidata=
    }}

These templates, as well as the above screenshot, already contain a spoiler, but more about that later. Using MediaWiki functionality it was now easy to make lists, e.g. for all pKa data (more spoilers):


I find a database like this very important. It does not capture all the information it should be capturing, though, as is clear from the proposal some of use worked on a while back. However, this project got on hold; I don't have time for it anymore, and it is not core to our department enough to spend time on write grant proposals for it.

But I still do not want to get this data get lost. Wikidata is something I have started using, as it is a machine readable CCZero database with an increasing amount of scientific knowledge. More and more people are working on it, and you must absolutely read this paper about this very topic (by a great team you should track, anyway). I am using it myself as source of identifier mappings and more. So, migrating the previously collected data to Wikidata makes perfect sense to me:

  1. if a compound is missing, I can easily create a new one using Bioclipse
  2. if a paper is missing, I can easily create a new one using Magnus Manske's QuickStatements
  3. Wikidata has a pretty decent provenance model
I can annotate data with the data source (paper) it came from and also experimental conditions:



In fact, you'll note that the the book is a separate Wikidata entry in itself. Better even, it's an 'edition' of the book. This is the whole point we make in the above linked H2020 proposal: Wikidata is not a database specific for one domain, it works for any (scholarly) domain, and seamlessly links all those domains.

Now, to keep track of what data I have migrated, I am annotating DrugMet entries with links to Wikidata: everything with a Wikidata Q-code is already migrated. The above pKa table already shows Q-identifiers, but I also created them for all data sources I have used (three of them are two books and one old paper without a DOI):


I have still quite a number of entries to do, but all the protocols are set up now.

On the downstream side, Wikidata is also great because of their SPARQL end point. Something that I did not get worked out some weeks ago, I did manage yesterday (after some encouragement from @arthursmith): list all pKstatements, including literature source if available:

If you run that query on the Wikidata endpoint, you get a table like this:


We here see experimental data from two papers: 10.1021/ja01489a008 and 10.1021/ed050p510. This can all be displayed a lot fancier, like make histograms, tables with 2D drawings of the chemical structures, etc, but I leave that to the reader.

Re: How should we add citations inside software?

Practice is that many cite webpages for the software, sometimes even just list the name. I do not understand why scholars do not en masse look up the research papers that are associated with the software. As a reviewer of research papers I often have to advice authors to revise their manuscript accordingly, but I think this is something that should be caught by the journal itself. Fact is, not all reviewers seem to check this.

In some future, if publishers would also take this serious, we will citation metrics for software like we have to research papers and increasingly for data (see also this brief idea). You can support this by assigning DOIs to software releases, e.g. using ZENODO. This list on our research group's webpage shows some of the software releases:


My advice for citation software thus goes a bit beyond what traditionally request for authors:

  1. cite the journal article(s) for the software that you use
  2. cite the specific software release version using ZENODO (or compatible) DOIs

 This tweet gives some advice about citing software, triggering this blog post:
Citations inside software
Daniel Katz takes a step further and asked how we should add citations inside software. After all, software reuses knowledge too, stands on algorithmic shoulders, and this can be a lot. This is something I can relate to a lot: if you write a cheminformatics software library, you use a ton of algorithms, all that are written up somewhere. Joerg Wegner did this too in his JOELib, and we adopted this idea for the Chemistry Development Kit.

So, the output looks something like:


(Yes, I spot the missing page information. But rather than missing information, it's more that this was an online only journal, and the renderer cannot handle it well. BTW, here you can find this paper; it was my first first author paper.)

However, at a Java source code level it looks quite different:


The build process is taking advantage of the JavaDoc taglet API and uses a BibTeXML file with the literature details. The taglet renders it to full HTML as we saw above.

Bioclipse does not use this in the source code, but does have the equivalent of a CITATION file: the managers, that extend the Python, JavaScript, and Groovy scripting environments with domain specific functionality (well, read the paper!). You can ask in any of these scripting languages about citation information:

    > doi bridgedb

This will open the webpage of the cited article (which sometimes opens in Bioclipse, sometimes in an external browser, depending on how it is configured).

At a source code level, this looks like:


So, here are my few cents. Software citation is important!

Adding disclosures to Wikidata with Bioclipse

Last week the huge, bi-annual ACS meeting took place (#ACSSanDiego), during which commonly new drug (leads) are disclosed. This time too, like this one tweeted by Bethany Halford:

Because getting this information out in the open is important, I think it's a good idea to add them to Wikidata (see doi:10.3897/rio.1.e7573). So, with Bioclipse (doi:10.1186/1471-2105-8-59) I redrew the structure:


I previously blogged about how to add chemicals to Wikidata, but I realized that I wanted to also use Bioclipse to automate this process a bit. So, I wrote this script to generated the SMILES, InChI, InChIKey, double check the compound is not already in Wikidata (using the Wikidata SPARQL endpoint), an look up the PubChem compound identifier (example SMILES).

smiles = "CCCC"

mol = cdk.fromSMILES(smiles)
ui.open(mol)

inchiObj = inchi.generate(mol)
inchiShort = inchiObj.value.substring(6)
key = inchiObj.key // key = "GDGXJFJBRMKYDL-FYWRMAATSA-N"

sparql = """
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?compound WHERE {
  ?compound wdt:P235 "$key" .
}
"""

if (bioclipse.isOnline()) {
  results = rdf.sparqlRemote(
    "https://query.wikidata.org/sparql", sparql
  )
  missing = results.rowCount == 0
} else {
  missing = true
}

formula = cdk.molecularFormula(mol)

// Create the Wikidata QuickStatement,
// see https://tools.wmflabs.org/wikidata-todo/quick_statements.php

item = "LAST" // set to Qxxxx if you need to append info,
              // e.g. item = "Q22579236"

pubchemLine = ""
if (bioclipse.isOnline()) {
  pcResults = pubchem.search(key)
  if (pcResults.size == 1) {
    cid = pcResults[0]
    pubchemLine = "$item\tP662\t\"$cid\""
  }
}

if (!missing) {
  println "===================="
  println "Already in Wikidata as " + results.get(1,"compound")
  println "===================="
} else {
  statement = """
    CREATE
    
    $item\tDen\t\"chemical compound\"
    $item\tP233\t\"$smiles\"
    $item\tP274\t\"$formula\"
    $item\tP234\t\"$inchiShort\"
    $item\tP235\t\"$key\"
    $pubchemLine
  """

  println "===================="
  println statement
  println "===================="
}

The output of this script is a QuickStatement for Magnus Manske's tool (IMPORTANT: it's not meant to automate editing Wikidata! I only automate creating the input, which I carefully check (e.g. checking all stereochemistry is defined)! Note, how Bioclipse opens up the structure in a viewer with ui.open()), which is a list of commands to create and edit entries in Wikidata. You need to enable it first, but if you have an account, this is not too hard. Of course, the advantage is that it is a lot quicker. I have similar script to create QuickStatements starting with only a ChEMBL identifier.

The QuickStatement for GDC-0853 looks like:

    CREATE
    
    LAST Den "chemical compound"
    LAST P233 "O=C1C(=CC(=CN1C)c2ccnc(c2CO)N4C(=O)c3cc5c(n3CC4)CC(C)(C)C5)Nc6ncc(cc6)N7CCN(C[C@@H]7C)C8COC8"
    LAST P274 "C37H44N8O4"
    LAST P234 "1S/C37H44N8O4/c1-23-18-42(27-21-49-22-27)9-10-43(23)26-5-6-33(39-17-26)40-30-13-25(19-41(4)35(30)47)28-7-8-38-34(29(28)20-46)45-12-11-44-31(36(45)48)14-24-15-37(2,3)16-32(24)44/h5-8,13-14,17,19,23,27,46H,9-12,15-16,18,20-22H2,1-4H3,(H,39,40)/t23-/m0/s1"
    LAST P235 "WNEODWDFDXWOLU-QHCPKHFHSA-N"
    LAST P662 "86567195"


The first line creates a new Wikidata item, while the next ones add information about this compound. GDC-0853 is now also Q23304817. The label I added manually afterwards. Note how the Bioclipse script found the PubChem identifier, using the InChIKey. I also use this approach to add compounds to Wikidata that we have in WikiPathways.

Adding chemical compounds to Wikidata

Adding chemical compounds to Wikidata is not difficult. You can store the chemical formula (P274), (canonical) SMILES (P233), InChIKey (P235) (and InChI (P234), of course), as well various database identifiers (see what I wrote about that here). It also allows storing of the provenance, and has predicates for that too.

So, to enter a new structure for a compound, you should enter the compound information to Wikidata. Of course, make sure to create the needed accounts, particularly one for Wikidata (create account) (not sure if the next steps needs a more general Wikimedia account too).

Entering the research paper
Magnus Manske pointed me to this helper tool. If you have the DOI of the paper, it is easy to add a new paper. This is what the tool shows for doi:10.1128/AAC.01148-08 (but no longer when you try!):


You need permission to run this script and the tool will alert you about that, and give the instructions how to get permission. After I clicked the Open in QuickStatements I get this output, showing me an entry in Wikidata was created for this paper:


Later, I can use the new Q-code (Q22309806) to use as source for statements about the compound (formula, etc).

Draw your compound and get an InChIKey
The next step is to draw a compound and get an InChIKey. This can be done with many tools, including Bioclipse. Rajarshi opted for alternatives:

Then check if the compound is not already in Wikidata. You can use this SPARQL query for that using the InChIKey of the compound (it's for acetic acid, so it will be found):


For convenience, here the copy/pastable SPARQL:
    PREFIX wdt: 
    SELECT ?compound WHERE {
    ?compound wdt:P235 "QTBSBXVTEAMEQO-UHFFFAOYSA-N" .
    }
Entering the compound
So, the compound is not already in Wikidata, so time to add it. The minimal information you should provide is the following:
  • mark the new entry as 'instance of' (P) 'chemical compound (Q)
  • the chemical formula and SMILES (use as reference the paper)
    • add the reference to the paper you entered above
  • add the InChIKey and/or InChI
The first step is to create a new Wikidat entry. The Create new item menu in the left side panel can be used, showing a page like this:


As a label you can use the name used in the paper for the compound, even if a code, and as description 'chemical compound' will do for now; it can be changed later.
    Feel free to add as much information about the compound as you can find. There are some chemically rich entries in Wikidata, such as that for acetic acid (Q47512).

    The quality of SMILES strings in Wikidata

    Russian Wikipedia on tungsten hexacarbonyl.
    One thing that machine readability adds, is all sorts of machine processing. Validation of data consistency is one. For SMILES strings, one of the things you can do is test of the string parses at all. Wikidata is machine readable, and, in fact, easier to parse than Wikipedia, for which the SMILES strings were validated recently in a J. Cheminformatics paper by Ertl et al. (doi:10.1186/s13321-015-0061-y).

    Because I was wondering about the quality of the SMILES strings (and because people ask me about these things), I made some time today to run a test:
    1. SPARQL for all SMILES strings
    2. process each one of them with the CDK SMILES parser
    I can do both easily in Bioclipse with an integrated script:

    identifier = "P233" // SMILES
    type = "smiles"

    sparql = """
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    SELECT ?compound ?smiles WHERE {
      ?compound wdt:P233 ?smiles .
    }
    """
    mappings = rdf.sparqlRemote("https://query.wikidata.org/sparql", sparql)

    outFilename = "/Wikidata/badWikidataSMILES.txt"
    if (ui.fileExists(outFilename)) ui.remove(outFilename)
    fileContent = ""
    for (i=1; i<=mappings.rowCount; i++) {
      try {
        wdID = mappings.get(i, "compound")
        smiles = mappings.get(i, "smiles")
        mol = cdk.fromSMILES(smiles)
      } catch (Throwable exception) {
        fileContent += (wdID + "," + smiles + ": " +

                       exception.message + "\n")
      }
      if (i % 1000 == 0) js.say("" + i)
    }
    ui.append(outFilename, fileContent)
    ui.open(outFilename)

    It turns out that out of the more than 16 thousand SMILES strings in Wikidata, only 42 could not be parsed. That does not mean they are correct, but it does mean the are wrong. Many of them turned out to be imported from the Russian Wikipedia, which is nice, as it gives me the opportunite to work in that Wikipedia instance too :)

    At this moment, some 19 SMILES still need fixing (the list will chance over time, so by the time you read this...):

    SWAT4LS in Cambridge

    Wordle of the #swat4ls tweets.
    Last week the BiGCaT team were present with three person (Linda, Ryan, and me) at the Sematic Web Applications and Tools 4 Life Sciences meeting in Cambridge (#swat4ls). It's a great meeting, particularly because if the workshops and hackathon. Previously, I attended the meeting in Amsterdam (gave this presentation) and Paris (which I apparently did not blog about).

    Workshops
    I have mixed feelings about missing half of the workshops on Monday for a visit of one of our Open PHACTS partners, but do not regret that meeting at all; I just wish I could have done both. During the visit we spoke particularly about WikiPathways and our collaboration in this area.

    The Monday morning workshops were cool. First, Evan Bolton and Gang Fu gave an overview of their PubChemRDF work. I have been involved in that in the past, and I greatly enjoyed seeing the progress they have made, and a rich overview of the 250GB of data they make available on their FTP side (sorry, the rights info has not gotten any clearer over the years, but generally considered "open"). The RDF now covers, for example, the biosystems module too, so that I can query PubChem for all compounds in WikiPathways (and compare that against internal efforts).

    The second workshop I attended was by Andra and others about Wikidata. The room, about 50 people, all started editing Wikidata, in trade of a chocolate letter:

    The editing was about prevalence is two diseases. Both topics continued during the hackathon, see below. Slides of this presentation are online. But I missed the DisGeNET workshop, unfortunately :(

    Conference
    The conference itself (in the new part of Clare College, even the conference dinner) started on the second day, and all presentations are backed by a paper, linked from the program. Not having attended a semantic web conference in the past 2~ish years, it was nice to see the progress in the field. Some papers I found interesting:
    But the rest is most worthwhile checking out too! The Webulous I as able to get going with some help (not paying enough attention to the GUI) for eNanoMapper:

    A Google Spreadsheet where I restricted the content of a set of cells to only subclasses of the "nanomaterial" class in the eNanoMapper ontology (see doi:10.1186/s13326-015-0005-5).
    The conference ended with a panel discussion, and despite our efforts of me and the other panel members (Frank Gibson – Royal Society of Chemistry, Harold Solbrig – Mayo Clinic, Jun Zhao, University of Oxford), it took long before the conference audience really started joining in. Partly this was because the conference organization asked the community for questions, and the questions clearly did not resonate with the audience. It was not until we started discussing publishing that it became more lively. My point there was I believe the semantic web applications and tools are not really a rate limiting factor anymore, and if we really want to make a difference, we really must start changing the publishing industry. This has been said by me and others for many years already, but the pace at which things change it too low. Someone mentioned a chicken-and-egg situation, but I really believe it is all just a choice we make and an easy solution: pick up a knife, kill the chicken, and have a nice dinner. It is annoying to see all the great efforts at this conference, but much of it limited because our writing style makes nice stories and yields few machine readable facts.

    Hackathon
    The hackathon was held at the EBI in Hinxton (south/elixir building) and during the meeting I had a hard time deciding what to hack on: there just were too many interesting technologies to work on, but I ended up working on PubChem/HDT (long) and Wikidata (short). The timings are based on the amount of help I needed to bootstrap things and how much I can figure out at home (which is a lot for Wikidata).

    HDT (header, dictionary, triple) is a not-so-new-but-under-the-radar technology for binary storing triples in a file based store. The specification outlines this binary format as well as the index. That means that you can share triple data compressed and indexed. That opens up new possibilities. One thing I am interested in, is using this approach for sharing link sets (doi:10.1007/978-3-319-11964-9_7) for BridgeDb, our identifier mapping platform. But there is much more, of course: share life science databases on your laptop.

    This hack was triggered by a beer with Evan Bolton and Arto Bendiken. Yes, there is a Java library, hdt-java, and for me the easiest way to work out how to use a Java API, is to write a Bioclipse plugin. Writing the plugin is trivial, though setting up a Bioclipse development is less so: the New Wizard does the hard work in seconds. But then started the dependency hacking. The Jena version it depended on is incompatible with the version in Bioclipse right now, but that is not a big deal for Eclipse, and the outcome is that we have both version on the classpath :) That, however, did require me to introduce a new plugin, net.bioclipse.rdf.core with the IRDFStore, something I wanted to do for a long time, because that is also needed if one wants to use Sesame/OpenRDF instead of Jena.

    So, after lunch I was done with the code cleanup, and I got to the HDT manager again. Soon, I could open a HDT file. I first had the API method to read it into memory, but that's not what I wanted, because I want to open large HDT files. Because it uses Jena, it conveniently provides a Jena Model object, so adding SPARQL-ing support was easy; I cannot use the old SPARQL-ing code, because then I would start mixing Jena versions, but since all is Open Source, I just copied/pasted the code (which is written by me in the first place, doi:10.1186/2041-1480-2-s1-s6, interestingly, work that originates from my previous SWAT4LS talk :). Then, I could do this:
    It is file based, which has different from a full triple store server. So, questions arise about performance. Creating an index takes time and memory (1GB of heap space, for example). However, the index file can be shared (downloaded) and then a HDT file "opens" in a second in Bioclipse. Of course, the opening does not do anything special, like loading into memory, and should be compared to connecting to a relational database. The querying is what takes the time. Here are some numbers for the Wiktionary data that the RDFHDT team provides as example data set:
    However, I am not entirely sure what to compare this against. I will have to experiment with, for example, ChEMBL-RDF (maybe update the Uppsala version, see doi:10.1186/1758-2946-5-23). The advantage would be that ChEMBL data could easily be distributed along with Bioclipse to service the decision support features. Because the typical query is asking for data for a parcicular compound, not all compounds. If that works within less than 0.1 seconds, then this may give a nice user experience.

    But before I reach that, it needs a bit more hacking:
    1. take the approach I took with BridgeDb mapping databases for sharing HDT files (which has the advantage that you get a decent automatic updating system, etc)
    2. ensure I can query over more than one HDT file
    And probably a bit more.

    Wikidata and WikiPathways
    After the coffee break I joined the Wikidata people, and sat down to learn about the bots. However, Andra wanted to finish something else first, where I could help out. Considering I probably manage to hack up a bot anyway, we worked on the following. Multiple database about genes, proteins, and metabolites like to link these biological entities to pathways in WikiPathways (doi:10.1093/nar/gkv1024). Of course, we love to collaborate with all the projects that integrate WikiPathways into their systems, but I personally rather use a solution that services all needs. If only because then people can do this integration without needing our time. Of course, this is an idea we pitched about a year ago in the Enabling Open Science: WikiData for Research proposal (doi:10.5281/zenodo.13906).

    That is, would it not be nice of people can just pulled the links between the biological entities to WikiPathways from Wikidata, using one of the many APIs they have (SPARQL, REST), supporting multiple formats (XML, JSON, RDF)? I think so, as you might have guessed. So does Andra, and he asked me if I could start the discussions in the Wikidata community, which I happily did. I'm not sure about the outcome, because despite having links like these is not of their prime interest - they did not like the idea of links to the Crystallography Open Database much yet, with the argument it is a one-to-many relation - though this is exactly what the PDB identifier is too, and that is accepted. So, it's a matter of notability again. But this is what the current proposal looks like:


    Let's see how the discussion unfolds. Please feel tree to coin in and show your support, comments, questions, or opposition, so that we can together get this right.

    Chemistry Development Kit
    There is undoubtedly a lot more, but I have been summarizing the meeting for about three hours now, getting notes together etc. A last thing I want to mention now, is the CDK. Cheminformatics is, afterall, a critical feature of life science data, and spoke with a few about the CDK. And I visited NextMove Software on Friday where John May works nowadays, who did a lot of work on the CDK recently (we also spoke about WikiPathways and eNanoMapper). NextMove is doing great stuff (thanks for the invitation), and so did John during his PhD in Chris Steinbeck's group at the EBI. But during the conference I also spoke with others about the CDK and following up on these conversations.