Predictive modeling with Bioclipse-R - Part 2 |

*Alt. 1: Using the GUI*

*i*. Start up Bioclipse and create a new project named *AmesMutagenicity* (menu alternative **File > New...** and in the wizard **General > Project**).

*ii*. Download the Hansen AMES dataset in SMILES format (cas_N6512.smi) and store the file on your local computer. Drag and drop the file into the previously created project in the Bioclipse Navigator (choose Copy File if asked to link or not).

*iii*. Right-click the SD-file and choose **Convert > Convert SMILES to SDF**. Again, to save time, one can download the pre-converted file cas_N6512.sdf and copy it into the same project in Bioclipse. Double-click the SD-file to inspect the contents. We see that this file contains 6504 molecules and has two properties for each, name and class. For class, +1 means the compound has shown to be positive in an AMES test, and 0 means it has shown a negative AMES result.

*iv*. For this dataset we will use the signature descriptor to describe atom environments. This has been shown to produce accurate models with interpretable results. Right-click the SD-file and choose **Generate Dataset... > Signatures (Sparse)**. Choose height: **3** , Response property: **class**, Name property: **name** and save to a filename of your liking, e.g. *cas_N6512.csr* (time to complete the last two steps - ~2 minutes)^{*}.

*Alt. 1: Using Bioclipse Scripting Language*

The following script does all the necessary steps (available in the complete script ames-sign.js) (time to complete - ~2 minutes)^{*}.

//Create Bioclipse project and download molecules ui.newProject("AmesMutagenicity"); bioclipse.downloadAsFile("http://pele.farmbio.uu.se/bcr/data/cas_N6512.smi", "/AmesMutagenicity/cas_N6512.smi") //Load molecules from SMILES into memory mols=cdk.loadSMILESFile("/AmesMutagenicity/cas_N6512.smi"); //Convert SMILES to SDF by generating 2D coordinates mols=cdk.generate2dCoordinates(mols); //It is a good practice to save a SDF file before continuing cdk.saveSDFile("/AmesMutagenicity/cas_N6512.sdf", mols); //Generate a sparse signature dataset sdset=signatures.generateSparseDataset(mols, 3, "name", "class"); //Save files to sparse dataset and vector of class belongings ui.newFile("/AmesMutagenicity/cas_N6512.csr", sdset.toSparseString(" ")) ui.newFile("/AmesMutagenicity/cas_N6512-classes.txt", sdset.getResponseValuesRaw()) //Save a list of all signatures to file sb = java.lang.StringBuffer(); for (i=0; i< sdset.getColHeaders().size(); i++){ sb.append(sdset.getColHeaders().get(i) + "\n"); } ui.newFile("/AmesMutagenicity/cas_N6512-sign.txt", sb.toString())Since the commands in lines

bioclipse.downloadAsFile("http://pele.farmbio.uu.se/bcr/data/cas_N6512.sdf", "/AmesMutagenicity/cas_N6512.sdf"); mols=cdk.loadMolecules("/AmesMutagenicity/cas_N6512.sdf");We save several files: The data set in sparse representation (cas_N6512.csr), the response values with class belongings for the molecules (cas_N6512-classes.txt), and a list of the signature descriptors (cas_N6512-sign.txt).

source("AmesMutagenicity/functions_cas.R")

Using signature descriptors means that a large number of independent variables are created for a large dataset like the CAS data. However, many entries in the data matrix are zero. We therefore use a sparse matrix representation in R to save computer memory and computational time. Moreover, the use of signature descriptors means that we have no NA values in the data matrix. Neither do we have any variables with zero standard deviation. Hence we do not need to worry about or handle these issues. Also - note that that SVM is sensitive to algorithm parameter choices (as opposed to random forest). We therefore tune SVM parameters using a grid search of the parameter spaces and cross-validated classification error as objective function.

Please note that the model building operation takes about 5-8 minutes to complete on a standard laptop. The data is rather big and many models are built during the cross-validation in the grid search for optimal SVM parameter values; although we are using a sparse matrix representation this requires some time.

# Read data cas.x <- read.matrix.csr(file="AmesMutagenicity/cas_N6512.csr") cas.y <- read.delim("AmesMutagenicity/cas_N6512-classes.txt", sep=",", header=F) cas.y <- as.factor(as.matrix(cas.y)) # Create a training set and an independent validation set of 1000 observations set.seed(456) tmp <- sample.int(nrow(cas.x), size=1000) cas.x.training <- cas.x[-tmp,] cas.x.validation <- cas.x[tmp,] cas.y.training <- cas.y[-tmp] cas.y.validation <- cas.y[tmp] # Tune SVM parameters and get best model cas.training.svm <- best.svm(x=cas.x.training, y=cas.y.training, gamma=c(0.001,0.01,0.1), cost=c(4,8,16), probability=T, tunecontrol=tune.control(cross=3)) # Test performance of tuned model using the independent validation set cas.prediction.validation <- predict(cas.training.svm, cas.x.validation, probability=T) cas.classProbabilities.validation <- attr(cas.prediction.validation, "probabilities")[,1] # Plot ROC curve for prediction of the independent validation set roc.curve <- plot.roc(cas.y.validation, cas.classProbabilities.validation, percent=TRUE, ci=TRUE) roc.curve.ci <- ci.se(roc.curve, specificities=seq(0, 100, 5)) plot(roc.curve.ci, type="shape", col="cornflowerblue") # Compute area under the ROC curve (AUC) and confidence intervals for the AUC auc(roc.curve) ci.auc(roc.curve)The ROC curve, produced using the predicted Ames mutagenicity on an external test set, shows that the model performs very well. The AUC is 87.85% with the following a 95% confidence interval: 89.98%-85.72%. Thus, we now know that the model exhibits good performance and we will therefore fit the model to the whole dataset (i.e. both the training data and the validation data). To ensure as good performance as possible of the model fitted to the whole data, we need to conduct a new grid search to optimize SVM parameters.

We now build and save the model based on all available data. Please note that this operation takes about 5-8 minutes to complete on a standard laptop. The data is rather big and many models are built during the cross-validation in the grid search for optimal SVM parameter values; although we are using a sparse matrix representation this requires some time.

# Fit an SVM model with tuned parameter choices to the whole dataset cas.svm <- best.svm(x=cas.x, y=cas.y, gamma=c(0.001,0.01,0.1), cost=c(4,8,16), probability=T, tunecontrol=tune.control(cross=3)) # Save SVM model as an Rdata file save(file="AmesMutagenicity/casSvmModel.Rdata", cas.svm)We end by saving the trained model to an Rdata file, which we in the next step will use for predictions in Bioclipse.

<test id="ames.hansen.r" name="Ames Mutagenicity" endpoint="net.bioclipse.ds.r.ames.endpoint" class="net.bioclipse.ds.r.SparseSignaturesRModelMatcher" override="false"> <resource name="signaturesfile" path="models/cas_N6512-sign.txt" /> <parameter name="signatures.min.height" value="0" /> <parameter name="signatures.max.height" value="3" /> <parameter name="rdata" value="models/casSvmModel.Rdata,models/functionsForVirtualScreeningApplication.Rdata" /> <parameter name="trainedModel" value="cas.svm" /> <parameter name="requiredPackages" value="SparseM,e1071" /> <parameter name="Observations" value="6504" /> <parameter name="Variables" value="26893" /> <parameter name="URL" value="http://dx.doi.org/10.1021/ci900161g" /> <parameter name="Dataset name" value="Hansen Mutagenicity Dataset" /> </test>After creating the plugin (you can call it

After upgrading the feature in Bioclipse, your new model should be available in the Bioclipse DS View. Just open a chemical structure and press "Run" in the DS View toolbar, and all models will run simultaneously. Click on the result from a mutagenicity prediction (showing the most significant signature), and this signature is highlighted in the query molecule.