The ACS Spring disclosures of 2017 #1

At the American Chemical Society meetings drug companies disclose recent new drugs to the world. Normally, the chemical structures are already out in the open, often as part of patents. But because these patents commonly discuss many compounds, the disclosures are a big thing.

Now, these disclosure meetings are weird. You will not get InChIKeys (see doi:10.1186/s13321-015-0068-4) or something similar. No, people sit down with paper, manually redraw the structure. Like Carmen Drahl has done in the past. And Bethany Halford has taken over that role at some point. Great work from both! The Chemical & Engineering News has aggregated the tweets into this overview.

Of course, a drug structure disclosure is not complete if it does not lead to deposition in databases. The first thing is to convert the drawings into something machine readable. And thanks to the great work from John May on the Chemistry Development Kit and the OpenSMILES team, I'm happy with this being SMILES. So, we (Chris Southan and me) started a Google Spreadsheet with CCZero data:



I drew the structures in Bioclipse 2.6.2 (which has CDK 1.5.13) and copy-pasted the SMILES and InChIKey into the spreadsheet. Of course, it is essential to get the stereochemistry right. The stereochemistry of the compounds was discussed on Twitter, and we think we got it right. But we cannot be 100% sure. For that, it would have been hugely helpful if the disclosures included the InChIKeys!

As I wrote before, I see Wikidata as a central resource in a web of linked chemical data. So, using the same code I used previously to add disclosures to Wikidata, I created Wikidata items for these compounds, except for one that was already in the database (see the right image). The code also fetches PubChem compound IDs, which are also listed in this spreadsheet.

The Wikidata IDs link to the SQID interface, giving a friendly GUI, one that I actually brought up before too. That said, until people add more information, it may be a bit sparsely populated:


But others are working on this series of disclosures too, and keep an eye on this blog post, as others may follow up with further information!

CDK AtomContainer's are Slow - Lets fix that

The core class for molecule representation in CDK is the AtomContainer. The AtomContainer uses an edge-list data structure for storing the underlying connection table (see The Right Representation for the Job).

Essentially this edge-list representation is efficient in space. Atoms can be shared between and belong to multiple AtomContainers. Therefore querying connectivity (is this atom connected to this other atom) is linear time in the number of bonds.

The inefficiency of the AtomContainer can really sting. If someone was to describe Morgan's relaxation algorithm you may implement it like Code 1. The algorithm looks reasonable however it will run much slower than you expected. You may expect the runtime of this algorithm to be ~N2 but it's actually ~N3. I've annotated with XXX where the extra effort creeps in.
Code 1 - Naive Morgan-like Relaxation (AtomContainer/AtomIter)
// Step 1. Algorithm body
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) { // 0..numAtoms
for (int j = 0; j < mol.getAtomCount(); j++) { // 0..numAtoms
IAtom atom = mol.getAtom(j);
// XXX: linear traversal! 0..numBonds
for (IBond bond : mol.getConnectedBondsList(atom)) {
IAtom nbr = bond.getConnectedAtom(atom);
// XXX: linear traversal! 0..numAtoms avg=numAtoms/2
next[j] += prev[mol.getAtomNumber(nbr)];
}
}o
System.arraycopy(next, 0, prev, 0, next.length);
}

A New Start: API Rewrite?


Ultimately to fix this problem correctly, would involve changing the core AtomContainer representation, unfortunately this would require an API change, optimally I think adding the constraint that atoms/bonds can not be in multiple molecules would be needed**. This would be a monumental change and not one I can stomach right now.

Existing Trade Off: The GraphUtil class


In 2013 I added the GraphUtil class for converting an AtomContainer to a more optimal adjacency list (int[][]) that was subsequently used to speed up many algorithms including: ring finding, canonicalisation, and substructure searching. Each time one of these algorithm is invoked with an IAtomContainer the first step is to build the adjacency list 2D array.

Code 2 - GraphUtil usage
IAtomContainer mol = ...;
int[][] adj = GraphUtil.toAdjList(mol);

// optional with lookup map to bonds
EdgeToBondMap e2b = EdgeToBondMap.withSpaceFor(mol);
int[][] adj = GraphUtil.toAdjList(mol, e2b);

Although useful the usage of GraphUtil is somewhat clunky requiring passing around not just the adjacency list but the original molecule and the EdgeToBondMap if needed.
Code 3 - GraphUtil Depth First Traversal

void visit(IAtomContainer mol, int[][] adj, EdgeToBondMap bondmap, int beg, int prev) {
mol.getAtom(beg).setFlag(CDKConstants.VISITED, true);
for (int end : adjlist[beg]) {
if (end == prev)
continue;
if (!mol.getAtom(end).getFlag(CDKConstants.VISITED))
visit(mol, adj, bondmap, end, beg);
else
bondmap.get(beg, end).setIsInRing(true); // back edge
}
}

Using the GraphUtil approach has been successful but due to the clunky-ness I've not felt comfortable exposing the option of passing these through to public APIs. It was only ever meant as an internal optimisation to be hidden from the caller. Beyond causing unintentional poor performance (Code 1) what often happens in a workflow is GraphUtil is invoked multiple times. A typical use case would be matching multiple SMARTS against one AtomContainer.

A New Public API: Atom and Bond References


I wanted something nicer to work with and came up with the idea of using object composition to extend the existing Atom and Bond APIs with methods to improve performance and connectivity checks.

Essentially the idea is to provide two classes, and AtomRef and BondRef that reference a given atom or bond in a particular AtomContainer. An AtomRef knows about the original atom it's connected bonds and the index, the BondRef knows about the original bond, it's index and the AtomRef for the connected atoms. The majority of methods (e.g. setSymbol, setImplicitHydrogenCount, setBondOrder) are passed straight through to the original atom. Some methods (such as setAtom on IBond) are blocked as being unmodifiable.

Code 4 - AtomRef and BondRef structure
class AtomRef implements IAtom {
IAtom atm;
int idx;
List<BondRef> bnds;
}

class BondRef implements IBond {
IBond bnd;
int idx;
AtomRef beg, end;
}

We can now re-write the Morgan-like relaxation (Code 1) using AtomRef and BondRef. The scaling of this algorithm is now ~N2 as you would expect.
Code 5 - Morgan-like Relaxation (AtomRef/AtomIter)
// Step 1. Initial up front conversion cost
AtomRef[] arefs = AtomRef.getAtomRefs(mol);

// Step 2. Algorithm body
int[] prev = new int[mol.getAtomCount()];
int[] next = new int[mol.getAtomCount()];
for (int i = 0; i < mol.getAtomCount(); i++) {
next[i] = prev[i] = mol.getAtom(i).getAtomicNumber();
}
for (int rep = 0; rep < mol.getAtomCount(); rep++) {
for (AtomRef aref : arefs) {
int idx = aref.getIndex();
for (BondRef bond : aref.getBonds()) {
next[idx] += prev[bond.getConnectedAtom(aref).getIndex()];
}
}
System.arraycopy(next, 0, prev, 0, next.length);
}

The depth first implementation also improves in readability and only requires two arguments.
Code 6 - AromRef Depth First (AtomRef/AtomFlags)
// Step 1. Initial up front conversion cost
void visit(AtomRef beg, BondRef prev) {
beg.setFlag(CDKConstants.VISITED, true);
for (BondRef bond : beg.getBonds()) {
if (bond == prev)
continue;
AtomRef nbr = bond.getConnectedAtom(beg);
if (!nbr.getFlag(CDKConstants.VISITED))
visit(nbr, bond);
else
bond.setIsInRing(true); // back edge
}
}


Benchmark


I like the idea of exposing the AtomRef and BondRef to public APIs. I wanted to check that the trade-off in calculating and using the AtomRef/BondRef vs the current internal GraphUtil. To test this I wrote a benchmark that implements some variants of a Depth First Search and Morgan-like algorithms. I varied the algorithm implementations and whether I used, IAtomContainer, GraphUtil, or AtomRef.

The performance was measured over ChEMBL 22 and averaged the run time performance over 1/10th (167,839 records). You can find the code on GitHub (Benchmark.java). Each algorithm computes a checksum to verify the same work is being done. Here are the raw results: depthfirst.tsv, and relaxation.tsv.


Depth First Traversal


A Depth first traversal is a linear time algorithm. I tested eight implementations that varied the graph data structure and whether I used an external visit array or atom flags to mark visited atoms. When looking just at initialisation time the AtomRef creation is about the same as GraphUtil. There was some variability between the different variants but I couldn't isolate where the different came from (maybe GC/JIT related). The runtime of the AtomRef was marginally slower than GraphUtil. Both were significantly faster (18-20x) than the AtomContainer to do the traversal. When we look at the total run-time (initialisation+traversal) we see that even for a linear algorithm, the AtomRef (and GraphUtil) were ~3x faster. Including the EdgeToBondMap adds a significant penalty.




Graph Relaxation


A more interesting test is a Morgan-like relaxation, as a more expensive algorithm (N2) it should emphasise any difference between the AtomRef and GraphUtil. The variability in this algorithm is whether we relax over atoms (AtomIter - see Code 1/5) or bonds (BondIter). We see a huge variability in AtomContainer/AtomIter implementation. This is because the algorithm is more susceptible to difference in input (molecule) size.



Clearly the AtomContainer/AtomIter is really bad (~80x slower). Excluding this results shows that as expected the AtomRef/AtomIter is slower than the GraphUtil/AtomIter equivalent (~2x slower). However because the AtomRef has a richer syntax, we can do a trick with XOR number storage to improve performance or iterate over bonds (BondIter) giving like-for-like speeds.



Conclusions


The proposed AtomRef and BondRef provide a convenience API to use the CDK in a natural way with efficient connectivity access. The conversion to an AtomRef is efficient and provides a speedup even for linear algorithms. The encapsulation facilities the passing as a public API parameter, users will be able to compute it ahead of time and pass it along to multiple algorithms.

I'm somewhat tempted to provide an equivalent AtomContainerRef allowing a drop-in replacement for methods that take the IAtomContainer interface. It is technically possible to implement writes (e.g. delete bond) efficiently in which case it would no longer be a 'Ref'. Maybe I'll omit that functionality or use a better name?

Footnotes


  • ** My colleague Daniel Lowe notes that OPSIN allows atoms to be in multiple molecules and know about their neighbours but it's a bit of a fudge. It's certainly possible with some extra book keeping but prevents some other optimisations from being applied.

Postdoc and PhD positions in Cheminformatics at Jena University, Germany

One postdoc position and three phd positions are available in my newly founded research group at Jena University, Germany.

I am currently moving from my previous position as Head of Cheminformatics and Metabolism at the European Bioinformatics Institute (EBI) to the Institute for Analytical Chemistry as Professor for Analytical Chemistry, Cheminformatics and Chemometrics at Jena University. The successful candidates will help forming the nucleus of the new research group and work in an exciting network of local and international collaborations, such as the PhenoMeNal project funded by the European Commission in their Horizon2020 framework program.

Open Positions:

  1. Postdoc: We are looking for a talented cheminformatician, bioinformatician or someone with comparable skills to work on the development cloud-based methods for computational metabolomics. The successful candidate will work closely with the H2020 e-infrastructure project PhenoMeNal, a European consortium of 14 partners. This position requires excellent skills in at least one modern, object-oriented programming language. A strong interest in metabolomics and cloud computing as well as the ability to work in a distributed team will be advantageous. The postdoc will also have the opportunity to participate in the day-to-day management of the group as well as in the organisation of seminars and practical courses for our students.
  2. PhD student, biomedical information mining: In this phd project the candidate will combine methods of text mining, image mining and cheminformatics to extract information about metabolites and natural products from the published primary literature. This includes opportunities to work with the OpenMinTed consortium, where we have been leading the biomedical use case in the last 1.5 years, as well as with the ContentMine team.
  3. PhD student, cheminformatic prediction of natural product structures: Depending on skills and interests of the successful candidate, this project can target the problem of structure prediction of natural products and metabolite from either the side of spectroscopic information which one might have about an unknown natural product or starting from the genome of a natural product producing organism. Two positions are available in this area.

All PhD positions require a strong interest in molecular informatics and current IT technologies, programming skills a modern object oriented programming language and the ability to work in geographically distributed teams.

Please send applications in PDF format by email to christoph.steinbeck@uni-jena.de. We will accept applications until the position is filled.

Background information:

The Friedrich Schiller University Jena (FSU Jena), founded in 1558, is one of the oldest universities in Europe and a member in the COIMBRA group, a network of prestigious, traditional European universities. The University of Jena has a distinguished record of innovations and resulting educational strengths in  major fields such as optics, photonics and optical technologies, innovative materials and related technologies, dynamics of complex biological systems and humans in changing social environments. It has more than 18,000 students. The university’s friendly and stimulating atmosphere and state-of-the-art facilities boost academic careers and enable excellence in learning, teaching and research. Assistance with proposing and inaugurating new research projects and with establishing public-private partnerships is considered a crucial point.

About Christoph Steinbeck

 

Alzheimer’s disease, PaDEL-Descriptor, CDK versions, and QSAR models

A new paper in PeerJ (doi:10.7717/peerj.2322) caught my eye for two reasons. First, it's nice to see a paper using the CDK in PeerJ, one of the journals of an innovative, gold Open Access publishing group. Second, that's what I call a graphical abstract (shown on the right)!

The paper describes a collection of Alzheimer-related QSAR models. It primarily uses fingerprints and the PaDeL-Descriptor software (doi:10.1002/jcc.21707) for it particularly. I just checked the (new) PaDeL-Descriptor website and it still seems to use CDK 1.4. The page has the note "Hence, there are some compatibility issues which will only be resolved when PaDEL-Descriptor updates to CDK 1.5.x, which will only happen when CDK 1.5.x becomes the new stable release." and I hope Yap Chun Wei will soon find time to make this update. I had a look at the source code, but with no NetBeans experience and no install instructions, I was unable to compile the source code. AMBIT is now up to speed with CDK 1.5, so the migration should not be too difficult.

Mind you, PaDEL is used quite a bit, so the impact of such an upgrade would be substantial. The Wiley webpage for the article mentions 184 citations, Google Scholar counts 369.

But there is another thing. The authors of the Alzheimer paper compare various fingerprints and the predictive powers of models based on them. I am really looking forward to a paper where the authors compare the same fingerprint (or set of descriptors) but with different CDK versions, particularly CDK 1.4 against 1.5. My guess is that the models based on 1.5 will be better, but I am not entirely convinced yet that the increased stability of 1.5 is actually going to make a significant impact on the QSAR performance... what do you think?


Simeon, S., Anuwongcharoen, N., Shoombuatong, W., Malik, A. A., Prachayasittikul, V., Wikberg, J. E. S., Nantasenamat, C., Aug. 2016. Probing the origins of human acetylcholinesterase inhibition via QSAR modeling and molecular docking. PeerJ 4, e2322+. 10.7717/peerj.2322

Yap, C. W., May 2011. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. Journal of Computational Chemistry 32 (7), 1466-1474. 10.1002/jcc.21707

The Groovy Cheminformatics scripts are now online

My Groovy Cheminformatics with the Chemistry Development Kit book sold more than 100 times via Lulu.com now. An older release can be downloaded as CC-BY from Figshare and was "bought" 39 times. That does not really make a living, but does allow me to financially support CiteULike, for example, where you can find all the references I use in the book.

The content of the book is not unique. The book exists for convenience, it explains things around the APIs, gives tips and tricks. In the first place for myself, to help me quickly answer questions on the cdk-user mailing list. This list is a powerful source of answers, and the archive covers 14 years of user support:


One of the design goals of the book was to have many editions allowing me to keep all scripts updated. In fact, all scripts in the book are run each time I make a new release of the book, and, therefore, which each release of the CDK that I make a book release for. That also explains why a new release of the book currently takes quite a bit of time, because there are so many API updates at the moment, as you can read about in the draft CDK 3 paper.

Now, I had for a long time also the plan to make the scripts freely available. However, I never got around to making the website to go with that. I have given up on the idea of a website and now use GitHub. So, you can now, finally, find the scripts for the two active book releases on GitHub. Of course, without the explanations and context. For that you need the book.

Happy CDK hacking!

Generic Structure Depiction

Last week I attended the Seventh Joint Sheffield Conference on Chemoinformatics. It was a great meeting with some cool science and attendees. I had the pleasure of chatting briefly with John Barnard who's contributed a lot to the representation, storage, and retrieval of generic (aka Markush) structures (see Torus, Digital Chemistry - now owned by Lhasa).

At NextMove we've been doing a bit on processing sketches from patents (see Sketchy Sketches). I learnt a few things about how generic structures are typically depicted I thought be interesting to share.

Substituent Variation (R groups)


The most common type of generic feature is substituent variation, colloquially known as R groups. The variation allows concise representation with an invariant/fixed part of a compound and variable/optional part.


wherein R denotes

That is: anisole, toluene, or ethylbenzene.

Substituent Labels


Multiple substituent labels may be distinguished by a number R1, R2, ... Rn. However in reality, any label can and will be used. This can be particularly confusing when they collide with elements, examples include: Ra (Radium), Rg (Roentgenium) B (Boron), D (Deuterium), Y (Yttrium), W (Tungsten). The distinction between the label Ra and Radium may be semantically captured by a format but lost in depiction.

To distinguish such labels we can style them differently. By using superscripting and italicizing the label the distinction becomes clear and also somewhat improves the aesthetics of numbered R groups. We avoid subscript due to ambiguities with stoichiometry, for example: –NR2.

Attachment Points


For substituents there are different notation options. In writing, radical nomenclature is used, for the above example we'd say: methyl-oxyl (-OMe), ethyl (-Et), or methyl (-Me). However this doesn't translate well to depictions: .

The CTfile actually does stores substituents this way and specifies the attachment point (APO) information separately.

$RGP
1
$CTAB
2 1 0 0 0 0 999 V2000
1.9048 -0.0893 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.6192 0.3232 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M APO 1 1 1
M END
$END CTAB
$CTAB
1 0 0 0 0 0 999 V2000
1.9940 -1.2869 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
M APO 1 1 1
M END
$END CTAB
$CTAB
2 1 0 0 0 0 999 V2000
1.8750 -2.3286 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.5895 -1.9161 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M APO 1 1 1
M END
$END CTAB
$END RGP
Alternatively we may use a virtual or 'null' atom. We can convert to/from CTfile format although it's slightly easier to delete the null atom that add it on, due to coordinate generation. A disadvantage of this is the atom count isn't accurate, however the labelled group is also a type of null atom and already distorts the atom count. There are unfortunately different ways of depicting this null atom.
Don't use a dative bond style! You have to fudge the valences and just doesn't work, how would I show a double bond attachment?

The first time I'd encountered attachment points was in ChEBI where and R group means 'something attaches here' (CHEBI:58314, CHEBI:52569), whilst a 'star' label means 'attaches to something' (CHEBI:37807, CHEBI:32861). This actually a nice way of thinking about it, like two jigsaw pieces the asymmetry allows the substituent to connect to the labelled atom.

The 'star' atom used by ChEBI is tempting to use as there is a star atom in SMILES.

*OC
*C
*CC

However a '*' in SMILES actually means 'unspecified atomic number', some toolkits impose additional semantics. ChemAxon reads a 'star' to mean 'any atom', whilst OEChem, Indigo, and OpenBabel actually read more like an R Group, with [*:1] and [*:2] being R1 and R2 etc. ChemAxon Extended SMILES allows us to explicitly encode attachment points.

*OC |$_AP1$|
*C |$_AP1$|
*CC |$_AP1$|
I opted to implement the wavy line notation in CDK which is preferred by IUPAC graphical representation guidelines.
A major disadvantage of this notation is mis-encoding by users mistaking it for a wavy up/down stereo bond. I talk more about this in the poster (Sketchy Sketches) but the number of times you see the following drawn:
The captured connection table for that sketch does not have null atoms but instead uses carbon:

CC-BY with the ACS Author Choice: CDK and Blue Obelisk papers liberated

Screenshot of an old CDK-based
JChemPaint, from the first CDK paper.
CC-BY :)
Already a while ago, the American Chemical Society (ACS) decided to allow the Creative Commons Attribution license (version 4.0) to be used on their papers, via their Author Choice program. ACS members pay $1500, which is low for a traditional publisher. While I even rather seem them move to a gold Open Access journal, it is a very welcome option! For the ACS business model it means a guaranteed sell of some 40 copies of this paper (at about $35 dollar each), because it will not immediately affect the sale of the full journal (much). Some papers may sell more than that had the paper remained closed access, but many for papers that sounds like a smart move money wise. Of course, they also buy themselves some goodwill and green Open Access is just around the corner anyway.

Better, perhaps, is that you can also use this option to make a past paper Open Access under a CC-BY license! And that is exactly what Christoph Steinbeck did with five of his papers, including two on which I am co-author. And these are not the least papers either. The first is the first CDK paper from 2003 (doi:10.1021/ci025584y), which featured a screenshot of JChemPaint shown above. Note that in those days, the print journal was still the target, so the screenshot is in gray scale :) BTW, given that this paper is cited 329 times (according to ImpactStory), maybe the ACS could have sold more than 40 copies. But for me, it means that finally people can read this paper about Open Science in chemistry, even after so many years. BTW, there is little chance the second CDK paper will be freed in a similar way.

The second paper that was liberated this way, is the first Blue Obelisk paper (doi:10.1021/ci050400b), which was cited 276 times (see ImpactStory):


This screenshot nicely shows how readers can see the CC-BY license for this paper. Note that it also lists that the copyright is with the ACS, which is correct, because in those days you commonly gave away your copyright to the publisher (I have stopped doing this, bar some unfortunate recent exceptions).

So, head over to your email client and email support@services.acs.org and let them know you also want your JCICS/JCIM paper available under a CC-BY license! No excuse anymore to make your seminal work in cheminformatics not available as gold Open Access!

Of course, submitting your new work to the Journal of Cheminformatics is cheaper and has the advantage that all papers are Open Access!

Postdoc in Bioinformatics for Anti-Obesity Strategy

Courtesy of Frank Genten

Courtesy of Frank Genten

The Steinbeck Group at European Bioinformatics Institute (EMBL-EBI, Cambridge, UK), together with the Lab of Tony Vidal-Puig (U. Cambridge/ WT Sanger), are excited to announce a joint opening for a Post-doc position to work on a multi-omics project to identify new players in the human brown and beige adipocyte recruitment as an anti-obesity strategy.  The project involves data analysis of multi-omics data sets (metabolomics, transcriptomics, proteomics, among others) and integration of that data into different mathematical modelling frameworks (discrete logical models, kinetic ODE-based, FBA). With these models and data, the fellow will identify novel pharmaceutical strategies to induce BAT generation/WAT browning. The models will be used to evaluate in silico the potential effect of drugs on adipocytes. Finally, the best candidate molecules will be applied to the human pluripotent stem cell models to confirm their capacity to induce brown/beige adipogenesis in-vitro. Experiments will be performed with the support of experts at WTSI.

Experience in applying mathematical modelling techniques is desirable, as well as previous exposure to large data sets, but it is not expected of course that the candidate has expertise in all of the listed above, as training will be given on parts were the applicant has less experience. 

The EMBL-EBI is part of the European Molecular Biology Laboratory (EMBL) and it is a world-leading bioinformatics centre providing biological data to the scientific community with expertise in data storage, analysis and representation. EMBL-EBI provides freely available data from life science experiments, performs basic research in computational biology and offers an extensive user training programme, supporting researchers in the academic and industrial sectors.

EMBL-EBI and Wellcome Trust Sanger Institute share the Wellcome Genome Campus. This proximity fosters close collaborations and contributes to an international and vibrant campus environment. Researchers are supported by easy access to scientific expertise, well-equipped facilities and an active seminar programme.

The EMBL-EBI–Sanger Postdoctoral (ESPOD) Programme builds on the strong collaborative relationship between the two institutes, offering projects which combine experimental (wet lab) and computational approaches.

Please apply here: http://www.embl.de/jobs/searchjobs/index.php?ref=EBI_00732&newlang=1

I have absolutely no clue why this paper is citing the CDK...

I have absolutely no clue why this paper is citing the CDK...

Corrosion behaviors and effects of corrosion products of plasma electrolytic oxidation coated AZ31 magnesium alloy under the salt spray corrosion test

Screen reader users, click here to load entire articleThis page uses JavaScript to progressively load the article content as a user scrolls. Screen reader users, click the load entire article button to bypass dynamically loaded article content. Please note that Internet Explorer version 8.x will ...

Re: How should we add citations inside software?

Practice is that many cite webpages for the software, sometimes even just list the name. I do not understand why scholars do not en masse look up the research papers that are associated with the software. As a reviewer of research papers I often have to advice authors to revise their manuscript accordingly, but I think this is something that should be caught by the journal itself. Fact is, not all reviewers seem to check this.

In some future, if publishers would also take this serious, we will citation metrics for software like we have to research papers and increasingly for data (see also this brief idea). You can support this by assigning DOIs to software releases, e.g. using ZENODO. This list on our research group's webpage shows some of the software releases:


My advice for citation software thus goes a bit beyond what traditionally request for authors:

  1. cite the journal article(s) for the software that you use
  2. cite the specific software release version using ZENODO (or compatible) DOIs

 This tweet gives some advice about citing software, triggering this blog post:
Citations inside software
Daniel Katz takes a step further and asked how we should add citations inside software. After all, software reuses knowledge too, stands on algorithmic shoulders, and this can be a lot. This is something I can relate to a lot: if you write a cheminformatics software library, you use a ton of algorithms, all that are written up somewhere. Joerg Wegner did this too in his JOELib, and we adopted this idea for the Chemistry Development Kit.

So, the output looks something like:


(Yes, I spot the missing page information. But rather than missing information, it's more that this was an online only journal, and the renderer cannot handle it well. BTW, here you can find this paper; it was my first first author paper.)

However, at a Java source code level it looks quite different:


The build process is taking advantage of the JavaDoc taglet API and uses a BibTeXML file with the literature details. The taglet renders it to full HTML as we saw above.

Bioclipse does not use this in the source code, but does have the equivalent of a CITATION file: the managers, that extend the Python, JavaScript, and Groovy scripting environments with domain specific functionality (well, read the paper!). You can ask in any of these scripting languages about citation information:

    > doi bridgedb

This will open the webpage of the cited article (which sometimes opens in Bioclipse, sometimes in an external browser, depending on how it is configured).

At a source code level, this looks like:


So, here are my few cents. Software citation is important!