This is Part 2 of the tutorial Predictive modeling with Bioclipse-R
Predictive modeling with Bioclipse-R - Part 2

Ames mutagenicity modeling

Since it is not uncommon for anti-cancer drugs to be mutagenic, we want to be able to test compounds predicted to have anticancer activity against a mutagenicity prediction model. We will therefore construct a QSAR model predicting mutagenicity based on a set of 6504 compounds tested with the Ames assay (Hansen, K. et al., 2009).

1. Import and preprocess molecules and calculate descriptors

Alt. 1: Using the GUI
i. Start up Bioclipse and create a new project named AmesMutagenicity (menu alternative File > New... and in the wizard General > Project).

ii. Download the Hansen AMES dataset in SMILES format (cas_N6512.smi) and store the file on your local computer. Drag and drop the file into the previously created project in the Bioclipse Navigator (choose Copy File if asked to link or not).

iii. Right-click the SD-file and choose Convert > Convert SMILES to SDF. Again, to save time, one can download the pre-converted file cas_N6512.sdf and copy it into the same project in Bioclipse. Double-click the SD-file to inspect the contents. We see that this file contains 6504 molecules and has two properties for each, name and class. For class, +1 means the compound has shown to be positive in an AMES test, and 0 means it has shown a negative AMES result.

iv. For this dataset we will use the signature descriptor to describe atom environments. This has been shown to produce accurate models with interpretable results. Right-click the SD-file and choose Generate Dataset... > Signatures (Sparse). Choose height: 3 , Response property: class, Name property: name and save to a filename of your liking, e.g. cas_N6512.csr (time to complete the last two steps - ~2 minutes)*.

Alt. 1: Using Bioclipse Scripting Language
The following script does all the necessary steps (available in the complete script ames-sign.js) (time to complete - ~2 minutes)*.

//Create Bioclipse project and download molecules
ui.newProject("AmesMutagenicity");
bioclipse.downloadAsFile("http://pele.farmbio.uu.se/bcr/data/cas_N6512.smi", "/AmesMutagenicity/cas_N6512.smi")

//Load molecules from SMILES into memory
mols=cdk.loadSMILESFile("/AmesMutagenicity/cas_N6512.smi");

//Convert SMILES to SDF by generating 2D coordinates
mols=cdk.generate2dCoordinates(mols);
//It is a good practice to save a SDF file before continuing
cdk.saveSDFile("/AmesMutagenicity/cas_N6512.sdf", mols);

//Generate a sparse signature dataset
sdset=signatures.generateSparseDataset(mols, 3, "name", "class");

//Save files to sparse dataset and vector of class belongings
ui.newFile("/AmesMutagenicity/cas_N6512.csr", sdset.toSparseString(" "))
ui.newFile("/AmesMutagenicity/cas_N6512-classes.txt", sdset.getResponseValuesRaw())

//Save a list of all signatures to file
sb = java.lang.StringBuffer();
for (i=0; i< sdset.getColHeaders().size(); i++){
	sb.append(sdset.getColHeaders().get(i) + "\n");
}
ui.newFile("/AmesMutagenicity/cas_N6512-sign.txt", sb.toString())
Since the commands in lines 6 and 9 take a while, feel free to download the pre-converted file cas_N6512.sdf into the same project in Bioclipse and load it into memory, and then continue from line 14.
bioclipse.downloadAsFile("http://pele.farmbio.uu.se/bcr/data/cas_N6512.sdf", "/AmesMutagenicity/cas_N6512.sdf");
mols=cdk.loadMolecules("/AmesMutagenicity/cas_N6512.sdf");
We save several files: The data set in sparse representation (cas_N6512.csr), the response values with class belongings for the molecules (cas_N6512-classes.txt), and a list of the signature descriptors (cas_N6512-sign.txt).


2. Build and validate QSAR model

For training and testing the dataset we use R from within Bioclipse. A script containing the complete R code can be downloaded as cas.R (time to complete the R code - ~15 minutes)*. Also, download the file functions_cas.R, import it in your AmesMutagenicity project in Bioclipse, and run:
source("AmesMutagenicity/functions_cas.R")

Using signature descriptors means that a large number of independent variables are created for a large dataset like the CAS data. However, many entries in the data matrix are zero. We therefore use a sparse matrix representation in R to save computer memory and computational time. Moreover, the use of signature descriptors means that we have no NA values in the data matrix. Neither do we have any variables with zero standard deviation. Hence we do not need to worry about or handle these issues. Also - note that that SVM is sensitive to algorithm parameter choices (as opposed to random forest). We therefore tune SVM parameters using a grid search of the parameter spaces and cross-validated classification error as objective function.

Please note that the model building operation takes about 5-8 minutes to complete on a standard laptop. The data is rather big and many models are built during the cross-validation in the grid search for optimal SVM parameter values; although we are using a sparse matrix representation this requires some time.

# Read data
cas.x <- read.matrix.csr(file="AmesMutagenicity/cas_N6512.csr")
cas.y <- read.delim("AmesMutagenicity/cas_N6512-classes.txt", sep=",", header=F)
cas.y <- as.factor(as.matrix(cas.y))

# Create a training set and an independent validation set of 1000 observations
set.seed(456)
tmp <- sample.int(nrow(cas.x), size=1000)
cas.x.training <- cas.x[-tmp,]
cas.x.validation <- cas.x[tmp,]
cas.y.training <- cas.y[-tmp]
cas.y.validation <- cas.y[tmp]

# Tune SVM parameters and get best model
cas.training.svm <- best.svm(x=cas.x.training, y=cas.y.training, gamma=c(0.001,0.01,0.1), cost=c(4,8,16), probability=T, tunecontrol=tune.control(cross=3))

# Test performance of tuned model using the independent validation set
cas.prediction.validation <- predict(cas.training.svm, cas.x.validation, probability=T)
cas.classProbabilities.validation <- attr(cas.prediction.validation, "probabilities")[,1]

# Plot ROC curve for prediction of the independent validation set
roc.curve <- plot.roc(cas.y.validation, cas.classProbabilities.validation, percent=TRUE, ci=TRUE)
roc.curve.ci <- ci.se(roc.curve, specificities=seq(0, 100, 5))
plot(roc.curve.ci, type="shape", col="cornflowerblue")

# Compute area under the ROC curve (AUC) and confidence intervals for the AUC
auc(roc.curve)
ci.auc(roc.curve)

The ROC curve, produced using the predicted Ames mutagenicity on an external test set, shows that the model performs very well. The AUC is 87.85% with the following a 95% confidence interval: 89.98%-85.72%. Thus, we now know that the model exhibits good performance and we will therefore fit the model to the whole dataset (i.e. both the training data and the validation data). To ensure as good performance as possible of the model fitted to the whole data, we need to conduct a new grid search to optimize SVM parameters.

We now build and save the model based on all available data. Please note that this operation takes about 5-8 minutes to complete on a standard laptop. The data is rather big and many models are built during the cross-validation in the grid search for optimal SVM parameter values; although we are using a sparse matrix representation this requires some time.

# Fit an SVM model with tuned parameter choices to the whole dataset
cas.svm <- best.svm(x=cas.x, y=cas.y, gamma=c(0.001,0.01,0.1), cost=c(4,8,16), probability=T, tunecontrol=tune.control(cross=3))

# Save SVM model as an Rdata file
save(file="AmesMutagenicity/casSvmModel.Rdata", cas.svm)
We end by saving the trained model to an Rdata file, which we in the next step will use for predictions in Bioclipse.


*the timings were obtained using 2.7 GHz Intel Core i7 CPU, 8 GB 1333 MHz DDR3 RAM, Intel HD Graphics 3000 512 MB, Mac OS X Lion 10.7.5


4. Advanced: Package and deploy model as a Bioclipse Decision Support Plugin

Bioclipse decision-support (Bioclipse-DS) is a set of plugins for Bioclipse aimed at aiding scientists in primarily chemical liability assessment. Packaging the model as a plugin for Bioclipse-DS is described in Part 1 of the tutorial. What differs is the test definition in plugin.xml, which looks like below for this dataset:
    <test
            id="ames.hansen.r"
            name="Ames Mutagenicity"
            endpoint="net.bioclipse.ds.r.ames.endpoint"
            class="net.bioclipse.ds.r.SparseSignaturesRModelMatcher"
            override="false">
            
            <resource name="signaturesfile" path="models/cas_N6512-sign.txt" />
            <parameter name="signatures.min.height" value="0" />
            <parameter name="signatures.max.height" value="3" />

            <parameter name="rdata" value="models/casSvmModel.Rdata,models/functionsForVirtualScreeningApplication.Rdata" />
            <parameter name="trainedModel" value="cas.svm" />
            <parameter name="requiredPackages" value="SparseM,e1071" />

            <parameter name="Observations" value="6504" />
            <parameter name="Variables" value="26893" />
            <parameter name="URL" value="http://dx.doi.org/10.1021/ci900161g" />
            <parameter name="Dataset name" value="Hansen Mutagenicity Dataset" />
            
      </test>

After creating the plugin (you can call it net.bioclipse.ds.r.ames), you can simply add it to the Feature created in Part 1 of the tutorial, increment the version number in the Feature, and press "Build All" in the Update site editor. You can now upgrade the feature in Bioclipse using the Software Update function as in Part 1 of the tutorial.

After upgrading the feature in Bioclipse, your new model should be available in the Bioclipse DS View. Just open a chemical structure and press "Run" in the DS View toolbar, and all models will run simultaneously. Click on the result from a mutagenicity prediction (showing the most significant signature), and this signature is highlighted in the query molecule.


Back to the main tutorial.