This is Part 1 of the tutorial Predictive modeling with Bioclipse-R
Predictive modeling with Bioclipse-R - Part 1

Cancer cell line growth inhibition modeling - NCI60 U251

The NCI60 is a panel of 59 cancer cell lines developed by the National Cancer Institute, which is used for screening for new anticancer drug candidates. Using a set of 3515 chemical compounds screened against the glioblastoma cell line U251, we build a QSAR model predicting whether a new compound has anticancer activity or not.

There are two ways to perform the analysis, one using the Graphical User Interface (GUI) of Bioclipse, and the other using the Bioclipse Scripting Language.

1. Import and preprocess molecules

Alt. 1: Using the GUI
i. Start up Bioclipse and create a new project named screenU251 (menu alternative File > New... and in the wizard General > Project).

ii. Download the U251 dataset in SMILES format (screen_U251.smi) and store the file and store it on your local computer. Drag and drop the file into the previously created project in the Bioclipse Navigator (choose Copy File if asked to link or not).

iii. Right-click the SD-file and choose Convert > Convert SMILES to SDF (time to complete - less than 1 minute)*. You can also download the pre-converted file screen_U251.sdf and copy it into the same project in Bioclipse. Double-click the SD-file to inspect the contents. We see that this file contains 3516 molecules and has two properties for each, name and class. For class, +1 means it inhibits cell growth, and -1 means it does not.

iv. Right-click the SD-file and choose Generate 3D... > Balloon (time to complete - ~25 minutes)*. Change the name of the generated file to "screen_U251_3D.sdf".

Alt. 2: Using Bioclipse Scripting Language
The following script does all the above steps (available in the complete script screen_U251_script.js) (time to complete - ~35 minutes)*.

   //Create a new project and download SMILES file
   ui.newProject("screenU251")
   bioclipse.downloadAsFile("http://pele.farmbio.uu.se/bcr-temp/data/screen_U251.smi", "/screenU251/screen_U251.smi")

   //Load mols from SMILES and save SDF
   mols=cdk.loadSMILESFile("/screenU251/screen_U251.smi")
   cdk.saveSDFile("/screenU251/screen_U251.sdf", mols)

   //Generate 3D and save SDF with 3D coords for convenience
   mols=balloon.generateMultiple3Dcoordinates(mols)
   cdk.saveSDFile("/screenU251/screen_U251_3D.sdf", mols)
Alt. 3: Using the pre-built SDF with 3D coordinates
This saves time.
ui.newProject("screenU251")
bioclipse.downloadAsFile("http://pele.farmbio.uu.se/bcr-temp/data/screen_U251_3D.sdf", "/screenU251/screen_U251_3D.sdf")


2. Compute descriptors and set up dataset

We have now the chemical structures in an SDF with 3D coordinates (to skip ahead here, download the file screen_U251_3D.sdf. We now proceed to describe these molecules numerically by calculating descriptors.

Alt. 1: Using the GUI
i. Create a new QSAR project and name it U251-QSAR (menu alternative File > New... and in the wizard QSAR > New QSAR project). In the new project, double-click the file qsar.xml to open up the QSAR Editor.

ii. Import molecules into the QSAR project by switching to the editor tab "Molecules" (at the bottom of the editor window) and then licking "Add". In the dialog that opens, select the file "screen_U251_3D.sdf" and click Next. Right-click on the property "class" and select as response property. Then click Finish.

   


iii. Select descriptors to calculate on the editor tab "Descriptors". Make a selection in the left canvas and press the button ">>" to add the descriptors. Be sure that the "Display only with implementation" box is checked, and add all the descriptors that are available.

iv. Switch to the "Responses" editor tab and fill in Response label: class, and select Response label placement in CSV: last.

v. Calculate the descriptors by pressing the Run icon in the top right corner of the QSAR Editor (time to complete - ~35 minutes)*. After calculation finishes, double-click the generated file dataset.csv to inspect it.


Alt. 2: Using Bioclipse Scripting Language
The following script does all the above steps (available in the complete script screen_U251_script.js) (time to complete - ~20 minutes)*.

//Create a new project for the QSAR analysis
ui.newProject("U251_QSAR")

//Set up list of descriptors - use all available in CDK
dlist=qsar.getDescriptorIDs(true)
descriptors=dlist.toString().replace(", ","\n").replace("[","").replace("]","")
ui.newFile("/U251_QSAR/screen-cdk.descriptors", descriptors)

//Do descriptor calculation, append the response property as last column
dset=qsar.calculate(mols,dlist)
dset.setNameProperty("name")
dset.setResponseProperty("class")

//Save results to file
ui.newFile("/U251_QSAR/dataset.csv", dset.asCSV("\t"))


3. Build and validate QSAR model

For training and testing the dataset we use R from within Bioclipse. You can download the script containing all the commands as u251.R (time to complete the R code - ~2 minutes)*.

Before proceeding make sure the file functions_U251.R is downloaded and imported it in your U251-QSAR project. Load the required R libraries and functions by sourcing the file:

source("U251-QSAR/functions_U251.R")
you can also open the file in Bioclipse's R editor and source it by clicking on the "Source R script" button in the main toolbar.

Begin the analysis:

# Read data
u251 <- read.delim("U251-QSAR/dataset.csv", sep="\t", na.strings="NaN", row.names=1)

# Create a training set and an independent validation set of 1000 observations
set.seed(123)
tmp <- sample.int(nrow(u251), size=1000)
u251.validation <- u251[tmp,]
u251.train <- u251[-tmp,]
Due to the fact that all CDK descriptors are not possible to compute for all molecules in the U251 dataset, there are NA values in the dataset. It can also happen that some descriptors have the same value for all molecules in the dataset (i.e. the descriptor has zero variance across the dataset). Such descriptors does not carry any useful information for fitting a model to predict the growth inhibition. These two issues (NA-values and descriptors with zero variance) need to be taken care of, which we do by using the functions 'naAndStdvTreatment' and 'imputation'. 'naAndStdvTreatment' removes variables in the dataset that has a large proportion of NA values or has zero variance. 'imputation' does a simple imputation of the remaining NA values with the median of the entries in the variable. Note that these procedures are performed only on the training data in order to preserve the independence between the training and the validation datasets. We will later use the results from 'naAndStdvTreatment' and 'imputation' _applied on the training data_ to remove variables and do the imputation on the validation data (see below).
# Handle variables with lots of NA and with zero standard deviation
u251.train.naAndStdvTreatment <- naAndStdvTreatment(u251.train, 0.2)
u251.train <- u251.train.naAndStdvTreatment$dataset

# Impute NA in variables with not so many NA
u251.train.imputed <- imputation(u251.train, response="class")
u251.train <- u251.train.imputed$imputedData

# Fit random forest model to training set
u251.train.rf <- randomForest(as.factor(class) ~ ., data=u251.train)

# Test performance of fitted model using the independent validation set. We here use the results from 'naAndStdvTreatment' and 'imputation' _applied on the training data_ to remove variables and do the imputation on the validation data.
u251.validation <- predict(u251.train.naAndStdvTreatment, u251.validation)
u251.validation <- predict(u251.train.imputed, u251.validation)
u251.classProbabilities.validation <- predict(u251.train.rf, u251.validation, type="prob")[,2]

# Plot ROC curve for prediction of the independent validation set
roc.curve <- plot.roc(u251.validation[,"class"], u251.classProbabilities.validation, percent=TRUE, ci=TRUE)
roc.curve.ci <- ci.se(roc.curve, specificities=seq(0, 100, 5))
plot(roc.curve.ci, type="shape", col="cornflowerblue")

# Compute area under the ROC curve (AUC) and confidence intervals for the AUC
auc(roc.curve)
ci.auc(roc.curve)

The ROC curve, produced using the predicted growth inhibition on an external test set, shows that the model performs relatively well (although not excellently). The AUC is 77.8% with the following a 95% confidence interval: 74.9%-80.7%, indicating relatively good predictive performance.
# Fit random forest model to the whole dataset (after dealing with NA and variables with zero standard deviation - see explanation above)
u251.naAndStdvTreatment <- naAndStdvTreatment(u251, 0.2)
u251 <- u251.naAndStdvTreatment$dataset
u251.imputed <- imputation(u251, response="class")
u251 <- u251.imputed$imputedData
u251.rf <- randomForest(as.factor(class) ~ ., data=u251)

# This is the fitted model that we will use for future predictions. We thus save the random forest model as an Rdata file
save(file="U251-QSAR/u251randomForestModel.Rdata", u251.rf, u251.naAndStdvTreatment, u251.imputed)

The following screencast demonstrates how R can be invoked and used from Bioclipse:



*the timings were obtained using 2.7 GHz Intel Core i7 CPU, 8 GB 1333 MHz DDR3 RAM, Intel HD Graphics 3000 512 MB, Mac OS X Lion 10.7.5


4. Advanced: Package and deploy model as a Bioclipse Decision Support Plugin

Bioclipse decision-support (Bioclipse-DS) is a set of plugins for Bioclipse aimed at aiding scientists in primarily chemical liability assessment. Packaging the model as a plugin for Bioclipse-DS involves several steps. Below is a draft outline of the main steps. Feel free to contact us if you need more detailed instructions.
  1. Downbload and install a recent version of Eclipse
  2. Check out Bioclipse from the source git repos, look here for details.
  3. In Eclipse, create a new Plug-in Project, name it for example net.bioclipse.ds.r.u251.
  4. Copy the Rdata file that we saved above into the plugin, place it into the folder "models" (create the folder if missing).
  5. Copy the Rdata file functionsForVirtualScreeningApplication.Rdata that we saved above into the plugin, place it into the same folder "models".
  6. If it does not exist, create a file plugin.xml and make it look like below:
    
     <plugin>
       <extension
             point="net.bioclipse.decisionsupport">
          <endpoint
                name="Cancer growth inhibition"
                description="Ability to inhibit growth in cancer cells"
                id="net.bioclipse.ds.r.u251.endpoint">
          </endpoint>
    
          <test
                id="screen.u251"
                name="U251 growth inhibition"
                endpoint="net.bioclipse.ds.r.u251.endpoint"
                class="net.bioclipse.ds.r.U251RModelMatcher"
                override="false">
                
                <resource name="cdk.descriptors" path="models/u251.cdk.desc" />
                <parameter name="rdata" value="models/u251randomForestModel.Rdata,models/functionsForVirtualScreeningApplication.Rdata" />
                <parameter name="trainedModel" value="u251.rf" />
                <parameter name="requiredPackages" value="randomForest" />
                <parameter name="requires3d" value="true" />
    
                <parameter name="Observations" value="3515" />
                <parameter name="Variables" value="210" />
                <parameter name="URL" value="http://genome-www.stanford.edu/nci60/" />
                <parameter name="Dataset name" value="NCI60-U251" />
                
           </test>
    
         </extension>
       </plugin>
       
    
  7. For the screening model we apply some logic that needs to go in a Java class. Create the class net.bioclipse.ds.r.U251RModelMatcher and make it look like:
    
    public class U251RModelMatcher extends DenseCDKRModelMatcher{
    
    	
    	protected String getRowNames(String tempvar) {
    		String rnames="names("+ tempvar + ") <- names(attr(u251.rf$terms, \"dataClasses\"))[-1]";
    		return rnames;
    	}
    
    
    	
    	/**
    	 * Provide the R commands to deliver the prediction command to R
    	 * from the input String (dense numerical vector with signature frequency).
    	 */
    	protected List<String> getPredictionString(String input){
    		
    		List<String> ret = new ArrayList<String>();
    		
    		String tempvar="tmp.pred."+getId();
    		String predictvar="predicted."+getId();
    		
    		ret.add(tempvar + " <- predict(u251.naAndStdvTreatment, " + input + ")");
    		ret.add(tempvar + " <- predict(u251.imputed, " + tempvar + ")");
    		ret.add("names("+ tempvar + ") <- rownames(u251.rf$importance)");
    		ret.add(predictvar + " <- predict(u251.rf, t(" + tempvar + "), type=\"prob\")[2]");
    		return ret;
    	}
    	
    }
    
    
  8. You need to create two more Eclipse projets: i) An Eclipse Feature to hold the plugin, and ii) An Eclipse Update Site plugin. Both are available from the menu New > Project... Add the created plugin to the feature, and add the feature to the Update Site Project. You are now ready to build the update site, click "Build All" in the Update Site Editor.
  9. You are now ready to install the feature in Bioclipse. The easiest way is to, in Bioclipse, open the Preferences and add an update site with the prefix "file://" followed by the local path on your computer. An alternative is to put the complete update site directory on a web server and add the URL as an update site. After adding the update site, go to menu Help > Software Update and your feature should appear. Install it, and after restart your model should be available in the Bioclipse DS View. Just open a chemical structure and press "Run" in the DS View toolbar, and the model will complete.


Back to the main tutorial.