Hypothesis: Anopheles gambiae pathways in WikiPathways have DataNode’s with labels containing IUPAC names which can be tagged as type Metabolite.
Start date: 2014-08-24 End date: 2014-08-24
Description: WikiPathways entries in GPML have DataNode objects and Label objects. It was found before [no published] that metabolites can be encoded in pathways is Label objects and therefore not machine-readable as Metabolite-type DataNode and unable to have a database identifier. As such, these metabolites are unusable for pathway analysis of metabolomics data.
By processing these GPML files (they are XML-based) and iterating over all Label’s we can attempt to convert this label into chemical structure with OPSIN. This goes under the assumption that if OPSIN can parse the label into a structure, it is one. This label will be recorded along with the pathway identifier for manual inspection. For each structure it will also look up a ChemSpider identifier.
- Download the GPML files from WikiPathways
- Get a working Bioclipse development version (hard) with the OPSIN, InChI, and ChemSpider extensions
- A Groovy script to iterate over the GPML, find <Label> elements
- Each <Label> is parsed with OPSIN and if successful, generate an InChI
- Use the InChIs to find ChemSpider identifiers
- Output all as a text file and open metabolites in a Structure table
Twelve WikiPathways for Anopheles gambiae were downloaded part of the analysis collection. In the future, uncurated pathways can also be included, anticipating to have more metabolites not annotated as Metabolite type. A custom Groovy script for Bioclipse was used, based on a previous similar script available from myExperiment.org. The updated script has been made available on myExperiment.org too. The results of running this script are visible in the above screenshot.
Key calls to Bioclipse managers used in this script, in addition to using the Groovy XMLParser, are:
Four metabolites were found, in one pathway (WP1230):
Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node b93 -> Serine -> MTCFGRXMJLQNBG-UHFFFAOYSA-N -> CSID: 
Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node ff7 -> Glycine -> DHMQDGOQFOQNFH-UHFFFAOYSA-N -> CSID: 
Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node c8c -> Deoxythymidine monophosphate -> WVNRRNJFRREKAR-UHFFFAOYSA-N -> CSID: 
Ag_One_Carbon_Metabolism_WP1230_68447.gpml: node a47 -> Deoxyuridine monophosphate -> JSRLJPSBLDHEIO-UHFFFAOYSA-N -> CSID: [21537275, 668, 21230588]
Three metabolites have a single ChemSpider identifier, whereas one has three ChemSpider identifiers.
Conclusion: Anopheles gambiae pathways indeed also include metabolites encoded in GPML <Label> elements.