The quality of SMILES strings in Wikidata

Russian Wikipedia on tungsten hexacarbonyl.
One thing that machine readability adds, is all sorts of machine processing. Validation of data consistency is one. For SMILES strings, one of the things you can do is test of the string parses at all. Wikidata is machine readable, and, in fact, easier to parse than Wikipedia, for which the SMILES strings were validated recently in a J. Cheminformatics paper by Ertl et al. (doi:10.1186/s13321-015-0061-y).

Because I was wondering about the quality of the SMILES strings (and because people ask me about these things), I made some time today to run a test:
  1. SPARQL for all SMILES strings
  2. process each one of them with the CDK SMILES parser
I can do both easily in Bioclipse with an integrated script:

identifier = "P233" // SMILES
type = "smiles"

sparql = """
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?compound ?smiles WHERE {
  ?compound wdt:P233 ?smiles .
}
"""
mappings = rdf.sparqlRemote("https://query.wikidata.org/sparql", sparql)

outFilename = "/Wikidata/badWikidataSMILES.txt"
if (ui.fileExists(outFilename)) ui.remove(outFilename)
fileContent = ""
for (i=1; i<=mappings.rowCount; i++) {
  try {
    wdID = mappings.get(i, "compound")
    smiles = mappings.get(i, "smiles")
    mol = cdk.fromSMILES(smiles)
  } catch (Throwable exception) {
    fileContent += (wdID + "," + smiles + ": " +

                   exception.message + "\n")
  }
  if (i % 1000 == 0) js.say("" + i)
}
ui.append(outFilename, fileContent)
ui.open(outFilename)

It turns out that out of the more than 16 thousand SMILES strings in Wikidata, only 42 could not be parsed. That does not mean they are correct, but it does mean the are wrong. Many of them turned out to be imported from the Russian Wikipedia, which is nice, as it gives me the opportunite to work in that Wikipedia instance too :)

At this moment, some 19 SMILES still need fixing (the list will chance over time, so by the time you read this...):

SWAT4LS in Cambridge

Wordle of the #swat4ls tweets.
Last week the BiGCaT team were present with three person (Linda, Ryan, and me) at the Sematic Web Applications and Tools 4 Life Sciences meeting in Cambridge (#swat4ls). It's a great meeting, particularly because if the workshops and hackathon. Previously, I attended the meeting in Amsterdam (gave this presentation) and Paris (which I apparently did not blog about).

Workshops
I have mixed feelings about missing half of the workshops on Monday for a visit of one of our Open PHACTS partners, but do not regret that meeting at all; I just wish I could have done both. During the visit we spoke particularly about WikiPathways and our collaboration in this area.

The Monday morning workshops were cool. First, Evan Bolton and Gang Fu gave an overview of their PubChemRDF work. I have been involved in that in the past, and I greatly enjoyed seeing the progress they have made, and a rich overview of the 250GB of data they make available on their FTP side (sorry, the rights info has not gotten any clearer over the years, but generally considered "open"). The RDF now covers, for example, the biosystems module too, so that I can query PubChem for all compounds in WikiPathways (and compare that against internal efforts).

The second workshop I attended was by Andra and others about Wikidata. The room, about 50 people, all started editing Wikidata, in trade of a chocolate letter:

The editing was about prevalence is two diseases. Both topics continued during the hackathon, see below. Slides of this presentation are online. But I missed the DisGeNET workshop, unfortunately :(

Conference
The conference itself (in the new part of Clare College, even the conference dinner) started on the second day, and all presentations are backed by a paper, linked from the program. Not having attended a semantic web conference in the past 2~ish years, it was nice to see the progress in the field. Some papers I found interesting:
But the rest is most worthwhile checking out too! The Webulous I as able to get going with some help (not paying enough attention to the GUI) for eNanoMapper:

A Google Spreadsheet where I restricted the content of a set of cells to only subclasses of the "nanomaterial" class in the eNanoMapper ontology (see doi:10.1186/s13326-015-0005-5).
The conference ended with a panel discussion, and despite our efforts of me and the other panel members (Frank Gibson – Royal Society of Chemistry, Harold Solbrig – Mayo Clinic, Jun Zhao, University of Oxford), it took long before the conference audience really started joining in. Partly this was because the conference organization asked the community for questions, and the questions clearly did not resonate with the audience. It was not until we started discussing publishing that it became more lively. My point there was I believe the semantic web applications and tools are not really a rate limiting factor anymore, and if we really want to make a difference, we really must start changing the publishing industry. This has been said by me and others for many years already, but the pace at which things change it too low. Someone mentioned a chicken-and-egg situation, but I really believe it is all just a choice we make and an easy solution: pick up a knife, kill the chicken, and have a nice dinner. It is annoying to see all the great efforts at this conference, but much of it limited because our writing style makes nice stories and yields few machine readable facts.

Hackathon
The hackathon was held at the EBI in Hinxton (south/elixir building) and during the meeting I had a hard time deciding what to hack on: there just were too many interesting technologies to work on, but I ended up working on PubChem/HDT (long) and Wikidata (short). The timings are based on the amount of help I needed to bootstrap things and how much I can figure out at home (which is a lot for Wikidata).

HDT (header, dictionary, triple) is a not-so-new-but-under-the-radar technology for binary storing triples in a file based store. The specification outlines this binary format as well as the index. That means that you can share triple data compressed and indexed. That opens up new possibilities. One thing I am interested in, is using this approach for sharing link sets (doi:10.1007/978-3-319-11964-9_7) for BridgeDb, our identifier mapping platform. But there is much more, of course: share life science databases on your laptop.

This hack was triggered by a beer with Evan Bolton and Arto Bendiken. Yes, there is a Java library, hdt-java, and for me the easiest way to work out how to use a Java API, is to write a Bioclipse plugin. Writing the plugin is trivial, though setting up a Bioclipse development is less so: the New Wizard does the hard work in seconds. But then started the dependency hacking. The Jena version it depended on is incompatible with the version in Bioclipse right now, but that is not a big deal for Eclipse, and the outcome is that we have both version on the classpath :) That, however, did require me to introduce a new plugin, net.bioclipse.rdf.core with the IRDFStore, something I wanted to do for a long time, because that is also needed if one wants to use Sesame/OpenRDF instead of Jena.

So, after lunch I was done with the code cleanup, and I got to the HDT manager again. Soon, I could open a HDT file. I first had the API method to read it into memory, but that's not what I wanted, because I want to open large HDT files. Because it uses Jena, it conveniently provides a Jena Model object, so adding SPARQL-ing support was easy; I cannot use the old SPARQL-ing code, because then I would start mixing Jena versions, but since all is Open Source, I just copied/pasted the code (which is written by me in the first place, doi:10.1186/2041-1480-2-s1-s6, interestingly, work that originates from my previous SWAT4LS talk :). Then, I could do this:
It is file based, which has different from a full triple store server. So, questions arise about performance. Creating an index takes time and memory (1GB of heap space, for example). However, the index file can be shared (downloaded) and then a HDT file "opens" in a second in Bioclipse. Of course, the opening does not do anything special, like loading into memory, and should be compared to connecting to a relational database. The querying is what takes the time. Here are some numbers for the Wiktionary data that the RDFHDT team provides as example data set:
However, I am not entirely sure what to compare this against. I will have to experiment with, for example, ChEMBL-RDF (maybe update the Uppsala version, see doi:10.1186/1758-2946-5-23). The advantage would be that ChEMBL data could easily be distributed along with Bioclipse to service the decision support features. Because the typical query is asking for data for a parcicular compound, not all compounds. If that works within less than 0.1 seconds, then this may give a nice user experience.

But before I reach that, it needs a bit more hacking:
  1. take the approach I took with BridgeDb mapping databases for sharing HDT files (which has the advantage that you get a decent automatic updating system, etc)
  2. ensure I can query over more than one HDT file
And probably a bit more.

Wikidata and WikiPathways
After the coffee break I joined the Wikidata people, and sat down to learn about the bots. However, Andra wanted to finish something else first, where I could help out. Considering I probably manage to hack up a bot anyway, we worked on the following. Multiple database about genes, proteins, and metabolites like to link these biological entities to pathways in WikiPathways (doi:10.1093/nar/gkv1024). Of course, we love to collaborate with all the projects that integrate WikiPathways into their systems, but I personally rather use a solution that services all needs. If only because then people can do this integration without needing our time. Of course, this is an idea we pitched about a year ago in the Enabling Open Science: WikiData for Research proposal (doi:10.5281/zenodo.13906).

That is, would it not be nice of people can just pulled the links between the biological entities to WikiPathways from Wikidata, using one of the many APIs they have (SPARQL, REST), supporting multiple formats (XML, JSON, RDF)? I think so, as you might have guessed. So does Andra, and he asked me if I could start the discussions in the Wikidata community, which I happily did. I'm not sure about the outcome, because despite having links like these is not of their prime interest - they did not like the idea of links to the Crystallography Open Database much yet, with the argument it is a one-to-many relation - though this is exactly what the PDB identifier is too, and that is accepted. So, it's a matter of notability again. But this is what the current proposal looks like:


Let's see how the discussion unfolds. Please feel tree to coin in and show your support, comments, questions, or opposition, so that we can together get this right.

Chemistry Development Kit
There is undoubtedly a lot more, but I have been summarizing the meeting for about three hours now, getting notes together etc. A last thing I want to mention now, is the CDK. Cheminformatics is, afterall, a critical feature of life science data, and spoke with a few about the CDK. And I visited NextMove Software on Friday where John May works nowadays, who did a lot of work on the CDK recently (we also spoke about WikiPathways and eNanoMapper). NextMove is doing great stuff (thanks for the invitation), and so did John during his PhD in Chris Steinbeck's group at the EBI. But during the conference I also spoke with others about the CDK and following up on these conversations.

Bringing Molfile Sgroups to the CDK - Rendering Tips

In the last but one post I gave a demonstration of S(ubstructure)group rendering in the CDK. Now I want to give some implementation insights.

Abbreviations (Superatoms)

Abbreviations contract part of a structure to a line formula or common abbreviation.


Full-structureAbbreviated-structure

Abbreviating too much or using unfamiliar terms (e.g. maybe using CAR for carbazole) can make a depiction worse. However some structures, like CHEMBL590010, can be dramatically improved.

CHEMBL590010

One way to implement abbreviations would be by modifying the molecule data structure with literal collapse/contract and expand operations. Whilst this approach is perfectly reasonable, deleting atoms/bonds is expensive (in most toolkits) and it somewhat subtracts the "display shortcut" nature of this Sgroup.

For efficiency abbreviations are implemented by hiding parts of the depictions and remapping symbols. Just before rendering we iterator over the Sgroups and set hint flags that these atoms/bonds should not be included in the final image. If there is one attachment (e.g. Phenyl) we remap the attach point symbol text to the abbreviation label ('C'->'Ph'). When there are no attachments (e.g. AlCl3) we add a new symbol to the centroid of the hidden atoms.

Hide atoms and bonds Symbol Remap Abbreviated Result

For two or more attachments (e.g. SO2) you also need coordinate remapping.

Multiple Group

Multiple groups allow, contraction of a discrete number of repeating units. They are handled similarly to the abbreviations except we don't need to remap parts.

CHEBI:1233

All atoms are present in the data structure but are laid out on top of each other (demonstrated below). We have a list of parent atoms that form the repeat unit. Therefore to display multiple groups we hide all atoms and bonds in the Sgroup except for parent atoms and the crossing bonds.

It's worth mentioning that hidden symbols are still generated but simply excluded from the final result. This allows bond back off for hetero atoms to be calculated correctly as is seen in this somewhat tangled example:

Brackets

Polymer and Multiple group Sgroups require rendering of brackets. Encoded in the molfile (and when laid out) brackets are described by two points, a line. It is therefore up to the renderer to decide which side of the line the tick marks should face.

I've seen some implementations use the order of the points to convey bracket direction. Another method would be to point the brackets at each other. As shown for CHEBI:59342 this is not always correct.


Poor bracket direction Preferred bracket direction
CHEBI:59342

I originally thought the solution might involve a graph traversal/flood-fill but it turns out there is a very easy way to get the direction correct. First we consider that brackets may or may not be placed on bonds, if a bracket is on a bond this information is available (crossing bonds).

  • For a bracket on a crossing bond exactly one end atom will be contained in the Sgroup, the bracket should point towards this atom.
  • If a bracket doesn't cross a bond then the direction should point to the centroid of all atoms in the Sgroup.

Internship in Bioinformatics/Cheminformatics

CaffeineWe are looking for a candidate for an internship/trainee position in bioinformatics/cheminformatics at the European Bioinformatics Institute (EMBL-EBI) to work on a high-performance generator for chemical structures. This position requires strong programming skills in Java, a reasonable working knowledge in chemical structures and graph theory as well as an interest in learning about Apache Hadoop and related technologies.
The initial contract will be for 6 month with a monthly internship salary of £800.

Please send your application to steinbeck@ebi.ac.uk

So, now you have SMILES that are faulty... visualize them?

So, you validated your list of SMILES in the paper you were planning to use (or about to submit), and you found a shortlist of SMILES strings that do not look right. Well, let's visualize them.

We all used to use the Daylight Depict tool, but this is no longer online. I blogged previously already about using AMBIT for SMILES depiction (which uses various tools for depiction; doi:10.1186/1758-2946-3-18), but now John May released a CDK-only tool, called CDK Depict. The download section offers a jar file and a war for easy deployment in a Tomcat environment. But for the impatient, there is also this online host where you can give it a try (it may go offline at some point?).


Just copy/paste your shortlist there, and visually see what is wrong with them :) Big HT to John for doing all these awesome things!

How to test SMILES strings in Supplementary Information

Source. License: CC-BY 2.0.
When you stumble upon a nice paper describing a new predictive or explanatory model for a property or a class of compounds that has your interest, the first thing you do is test the training data. For example, validating SMILES (or OpenSMILES) strings in such data files is now easy with the many Open Source tools that can parse SMILES strings: the Chemistry Toolkit Rosetta provides many pointers for parsing SMILES strings. I previously blogged about a CDK/Groovy approach.

Cheminformatics toolkits need to understand what the input is, in order to correctly calculate descriptors. So, let's start there. It does not matter so much which toolkit you use and I will use the Chemistry Development Kit (doi:10.1021/ci025584y) here to illustrate the approach.

Let's assume we have a tab-separated values file, with the compound identifier in the first column and the SMILES in the second column. That can easily be parsed in Groovy. For each SMILES we parse it and determine the CDK atom types. For validation of the supplementary information we only want to report the fails, but let's first show all atom types:

import org.openscience.cdk.smiles.SmilesParser;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
import org.openscience.cdk.atomtype.CDKAtomTypeMatcher;

parser = new SmilesParser(
  SilentChemObjectBuilder.getInstance()
);
matcher = CDKAtomTypeMatcher.getInstance(
  SilentChemObjectBuilder.getInstance()
);

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") { // header line
    mol = parser.parseSmiles(smiles)
    println "$id -> $smiles";

    // check CDK atom types
    types = matcher.findMatchingAtomTypes(mol);
    types.each { type ->
      if (type == null) {
        report += "  no CDK atom type\n"
      } else {
        println "  atom type: " + type.atomTypeName
      }
    }
  }
}

This gives output like:

mo1 -> COC
  atom type: C.sp3
  atom type: O.sp3
  atom type: C.sp3

If we rather only report the errors, we make some small modifications and do something like:

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") {
    mol = parser.parseSmiles(smiles)
    errors = 0
    report = ""

    // check CDK atom types
    types = matcher.findMatchingAtomTypes(mol);
    types.each { type ->
      if (type == null) {
        errors += 1;
        report += "  no CDK atom type\n"
      }
    }

    // report
    if (errors > 0) {
      println "$id -> $smiles";
      print report;
    }
  }
}

Alternatively, you can use the InChI library to do such checking. And here too, we will use the CDK and the CDK-InChI integration (doi:10.1186/1758-2946-5-14).

factory = InChIGeneratorFactory.getInstance();

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") {
    mol = parser.parseSmiles(smiles)

    // check InChI warnings
    generator = factory.getInChIGenerator(mol);
    if (generator.returnStatus != INCHI_RET.OKAY) {
      println "$id -> $smiles";
      println generator.message;
    }
  }
}

The advantage of doing this, is that it will also give warnings about stereochemistry, like:

mol2 -> BrC(I)(F)Cl
  Omitted undefined stereo

I hope this gives you some ideas on what to do with content in supplementary information of QSAR papers. Of course, this works just as well for MDL molfiles. What kind of validation do you normally do?

Java Serialization: Great power but at what cost?

The default Java serialization framework provides a convenient mechanism for streaming in-memory Objects to another computer or storing them on disk. Beyond the obvious badness of being tied to the internal object layout (i.e. not stable through changes), serialization can be very inefficient. Externalization and libraries like Kyro are popular for improving performance.

SMILES: CO[C@@H]([C@H](OC(C)=O)[C@@H](OC(C)=O)[C@H](OC(C)=O)[C@H](OC(C)=O)COC(C)=O)SC

In the domain of Chemistry we have a rich variety of formats (e.g. SMILES) with which we can store molecules and reactions (in memory these are labelled graphs). Although these formats do not completely fulfil the utility of Object serialization they can be used as building block upon which we build. Not only are these defacto standards but they can be much faster and compact than default serialization of the in-memory connection table (graph) representation.

Recent History

Crafting efficient (de)serialization is beneficial and you can get great speed with simple setup. A few years ago I ran some experiments on writing an externalization stream for the Chemistry Development Kit (CDK) molecules (thread - High Performance Structure IO). Since the objects are huge any improvement over the default would be useful. This partly fed into the needs of CDK-Knime (a workflow tool) where I think CML was being used originally. From testing on ChEBI (~20,000 molecules) we see actually the ObjectInputStream was about as fast as an SDfile and much faster than CML. SDfiles are now much faster but that would be another post.

Read Performance
Method Time Size Throughput
AtomContainerStream 346 ms 11.1 MiB 63739 s-1
SDfile 4159 ms 51.7 MiB 5302 s-1
CML 18605 ms 91.5 MiB 1185 s-1
ObjectInputStream 5552 ms 93.9 MiB 3972 s-1

It was around that time that Andrew Dalke payed a visit to EMBL-EBI. In discussing what I was currently working on he promptly showed me how fast OEChem could read/write SMILES. Needless to say – pretty quick and as fast if not faster than my attempt at 'High Throughput' streaming.

The CDK now also has fast SMILES processing and I wanted to compare this to the serialization to see how much of a performance penalty there is.

Benchmark

For a benchmark I used 100,000 structures for ChEMBL 20.

$ shuf chembl_20.smi | head -n 100,000 > chembl_20_subset.smi

Writing it to a ObjectOutputStream takes 28.78 seconds. The SMILES subset file takes up 6.8 MiB on disk whilst the serialized objects take up 295 MiB. Ouch, that's 42x larger.

Code 1 - Writing to an ObjectOutputStream
IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
SmilesParser smipar = new SmilesParser(bldr);

String srcname = "/data/chembl_20_subset.smi";
String destname = "/data/chembl_20_subset.obj";

try (InputStream in = new FileInputStream(srcname);
Reader rdr = new InputStreamReader(in, StandardCharsets.UTF_8);
BufferedReader brdr = new BufferedReader(rdr);
ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(destname))) {
String line;
long t0 = System.nanoTime();
while ((line = brdr.readLine()) != null) {
try {
IAtomContainer mol = smipar.parseSmiles(line);

// stereochemistry does not implement serializable...
// so need to remove it
mol.setStereoElements(new ArrayList(0));

oos.writeObject(mol);
} catch (CDKException e) {
System.err.println(e.getMessage());
}
}
long t1 = System.nanoTime();
System.err.printf("write time: %.2f s\n", (t1 - t0) / 1e9);
}

In CDK we first read SMILES with Beam and then convert to the CDK objects so we'll also look at that small overhead. Here I compare the time to read the 100,000 SMILES using Beam, CDK, and the objects using an ObjectInputStream. Both CDK and Beam take less than 1 second whilst the ObjectInputStream takes more than 50.

In terms of throughput (mol per sec) here is the kind of speed we hit. I also show the total elapsed time for all 15 repeats.

MethodMinMaxElapsed TimeSize
Deserialization1961 s-12089 s-112 m 16 s295 MiB
Kryo (Auto)42401 s-144557 s-133.9 s186 MiB
Kryo Unsafe (Auto)44854 s-147331 s-131.9 s231 MiB
CDK135286 s-1142126 s-110.7 s6.8 MiB
Beam347534 s-1489545 s-13.2 s6.8 MiB

Auxiliary Data

With a performance difference that huge why would anyone want to use Serialization? Some use-cases might be that a format doesn't store the parts we need. A common argument against SMILES is the lack of coordinates but we can simply store this supplementary to the SMILES if we no what the input order will be (Code 2).

Code 2 - Writing Coordinates with SMILES
IAtomContainer  mol = ...;
// 'Generic' - avoid canon SMILES we are not doing identity check
SmilesGenerator sg = SmilesGenerator.generic();

int n = mol.getAtomCount();
int[] order = new int[n];

// the order array is filled up as the SMILES is generated
String smi = sg.create(mol, order);

// load the coordinates array such that they are in the order the atoms
// are read when parsing the SMILES
Point2d[] coords = new Point2d[mol.getAtomCount()];
for (int i = 0; i < coords.length; i++)
coords[order[i]] = container.getAtom(i).getPoint2d();

// SMILES string suffixed by the coordinates
String smi2d = smi + " " + Arrays.toString(coords);

Using that same technique it's relatively simply to extend this to handle arbitrary data fields and it even forms the basis of ChemAxon's extended SMILES. A more advanced method would be combining the SMILES with a DataOutputStream since we know how many coordinates there are expected to be.

Summary

I'm certainly not against a performant AtomContainerInputStream but the default Java serialization should never be the first choice. Hopefully this post has put some numbers on why and will discourage knee-jerk usage.

Update

Added Kryo performance

PhD position in computational metabolomics or cheminformatics

We have an open PhD position in computational Metabolomics or Cheminformatics at my group at the European Bioinformatics Institute in Cambridge. Application deadline is soon and you need to register your interest even sooner :)
http://www.embl.de/training/eipp/application/index.html

Bringing Molfile Sgroups to the CDK - Demo

Despite the flaws, the molfile has been a defacto standard for chemical representation for several decades. The core format (atom and bond block) is well supported in many toolkits but more advanced features (dark corners) of the property block may be skipped.

At this year's Fall ACS (Boston '15) I bumped into an old colleague from ChEBI who told me they (ChEBI) couldn't use CDK because they wanted to display repeating brackets on records and CDK didn't do that.

Polymer representation (more precisely Structural Repeat Unit) used by ChEBI falls under the category of a Ctab Sgroup. I'd wanted to add support for Sgroups for some time and now had motivation to do so.


Substructure (or Substance) Groups

Over the years there seems to have been a shift in definition. The original literature[1] uses the term "substructure groups" but more recent materials use "substance groups"[2,3]. Personally I prefer "substructure" since it concisely summarises what they really are about.

Essentially an Sgroup annotates some part of the connection table (a substructure) with meta-information (data). There are several types of Sgroup that formalise the types of annotation present:

  • Display Shortcuts
    • Abbreviations
    • Multiple Groups
  • Polymers
    • Structural Repeat Unit (SRU)
    • Monomer
    • Copolymer (alternating, block, or random)
    • Mer
    • Crosslink
    • Graft
    • Modified
    • Any
  • Mixtures
    • Unordered Mixture
    • Ordered Mixture (formulation)
    • Component
  • Generic
  • Data

Example ChEBI Depictions

Egon reviewed the first patch (pull/149) last week that focussed on representation and molfile round tripping. The second patch enhances the rendering code to handle more than basic SRUs (e.g. >2 brackets) and display shortcuts.

As of ChEBI 131 there are 809 entries with at least one Sgroup. Generating the depictions of these from an SDfile took < 3 seconds, then a further 11 to actually write the files to disk. The rest of this post demonstrates some example of those depictions.

Display Shortcut, Abbreviations

Previously referred to as "superatoms", parts of a structure can be abbreviated to a more concise name (e.g. Ph for a phenyl substituent). The full structure is present but is only displayed when the expansion flag is set.

CHEBI:29441 CHEBI:7725

Display Shortcut, Multiple Group

Multiple groups allow structures with fixed repeating parts to be drawn more concisely. Similar to abbreviations, all the atoms and bonds are present but are hidden from display. They're actually all overlaid on one another with duplicated coordinates but for rendering you still want omit them from display.


CHEBI:1233 CHEBI:79399

Polymer, SRUs

The most common Sgroup used in ChEBI is the Structure Repeat Unit (SRU), an SRU defines a repeat unit of variable length. The brackets do not necessarily come in pairs, are parallel, or point towards each other.


CHEBI:16838 CHEBI:4294
CHEBI:53422 CHEBI:59342

Polymer, Others

A few entries encode copolymers and source-based representations (monomer).


CHEBI:59599 CHEBI:3814 (overlap in original)

Combinations

A structure can have more than one Sgroup and they can be nested. Here we see a multiple group within an SRU. There is also a data Sgroup attached to the Zn-N bond marking it as a coordination bond for Marvin. I've not decided whether to render those yet, but we have the information there.


CHEBI:81539

Additional Reading

  1. Gushurst et al. The substance module: the representation, storage, and searching of complex structures. J. Chem. Inf. Comput. Sci. (1991)
  2. Blanke G. Sgroups – Abbreviations, Mixtures, Formulations, Polymers, Structures with Statistical Distribution and Other Special Cases. Online - StructurePendium Technologies GmbH
  3. Accelrys Chemical Representation
  4. CTfile Formats Specification

Open Position for Bioinformatician/Ontologist

We are seeking to recruit an experienced Bioinformatician/Ontologist to work on the eNanoMapper EU project. This project is building a computational infrastructure for toxicological data management of engineered nanomaterials (ENMs) based on open standards, ontologies and an interoperable design to enable a more effective, integrated approach to European research in nanotechnology. You will join the Cheminformatics & Metabolism team at the European Bioinformatics Institute (EMBL-EBI) located at the Wellcome Genome Campus near Cambridge in the UK.

Our group is leading work package 2 of this project, which is developing an ontology for the full domain of nanosafety research, based on existing ontologies and using the standard Semantic Web ontology language OWL. You will work with the eNanoMapper partners on ontology software development and editing, addressing the requirements outlined by the consortium.

The EBI is part of the European Molecular Biology Laboratory (EMBL) and it is a world-leading bioinformatics centre providing biological data to the scientific community with expertise in data storage, analysis and representation. EMBL-EBI provides freely available data from life science experiments, performs basic research in computational biology and offers an extensive user training programme, supporting researchers in academic and industry.

Qualifications and Experience

You will be a capable software engineer, able to work comfortably in Java, and be familiar with the OWL language and associated APIs such as the OWL API. You furthermore need to be familiar with version control (GitHub) and continuous integration systems (Jenkins). You will also need to be able to use the Protégé ontology editing tool to create ontology content according to user requirements.

In addition to your technical expertise some familiarity with the biomedical domain is required; domain expertise in an area closely related to nanosafety would be a bonus.

You will have also experience with:
• Maven
• Batch/Script/Shell programming

You will have a strong interest in ontologies, solid technical skills, and an outgoing, collaborative personality. You must be able to work as part of a team but at the same time be self-driven, trustworthy and able to make progress towards objectives independently. You must also be meticulous and careful, with good communication skills, and willing to travel to European project meetings.

Benefits

EMBL is an inclusive, equal opportunity employer offering attractive conditions and benefits appropriate to an international research organisation. The remuneration package comprises a competitive salary, a comprehensive pension scheme and health insurance, educational and other family related benefits where applicable, as well as financial support for relocation and installation.

We provide a dynamic, international working environment and have close ties with both the University of Cambridge and the Wellcome Trust Sanger Institute.
EMBL-EBI staff also enjoy excellent sports facilities, a free shuttle bus to Cambridge and other nearby centres, an active sports and social club and an attractive working environment set in 55 acres of parkland.
The initial contract is for a period of 1 year and 4 months with the possibility of a fixed-term extension.

Application Instructions

We welcome applications irrespective of gender and appointment will be based on merit alone. Applications are welcome from all nationalities – visa information will be discussed in more depth with applicants selected for interview.
Please apply online through www.embl.org/jobs