Jeffrey Evans
Quantitative Methods in Spatial Ecology
Random Forests Species Distribution Model
In this tutorial we will use the Random Forests model to estimate the probabilities of a species distribution. We will leverage the spatial classes of R to construct the data used in the model, run a model selection procedure to specify a final model, predict a probability surface (raster), validate the model using a back-prediction method and generate various plots.
First we need to download the tutorial data. Change the working directory (setwd) to a place that you would like to work. The tutorial data will be downloaded and uncompressed into a RFLAB subdirectory automatically.
setwd("C:/TEMP")
download.file("http://s3.amazonaws.com/nrcs-sgi/EvansModels/RFLAB.zip", destfile="RFLAB.zip", mode="wb")
unzip("RFLAB.zip")
Data Preparation
We will now add the required packages, set the working environment, and define a few parameters.
library(randomForest)
library(rfUtilities)
library(sp)
library(raster)
library(rgdal)
setwd(paste(getwd(),"RFLAB",sep="/"))
inshape="CactusPresAbs" # Name of input point shapefile representing species P/A observations
b=1001 # Number of Bootstrap replicates
We can use a wildcard statement in the “list.files” function to create a vector of on-disk rasters in the working directory and an associated raster stack object. Note that after creating the list of rasters we remove entrie 8 from the vector because we do not elevation in the model.
rpath=getwd()
rlist=list.files(rpath, pattern="img$", full.names=TRUE)
rlist <- rlist[-c(8)]
xvars <- stack(rlist)
You could nest functions and shorten the above code to a single line.
xvars <- stack(list.files(getwd(), pattern="img$", full.names=TRUE)[-8])
We will now read the species observation point featureclass shapefile using the readOGR function in the rgdal library. After we read the point data into an sp SpatialPointsDataFrame object we assign values from the raster stack and display the structure of the resulting attribute table in the @data slot.
sdata <- readOGR(dsn=getwd(), layer=inshape)
sdata@data <- data.frame(sdata@data, extract(xvars, sdata))
str(sdata@data)
Multicolinearity test
To improve model fit we will test for multicolinearity using the “multi.collinear” function in the rfUtilities library. If any variables are identified as multicolinear we will remove them from the model.
cl <- multi.collinear(sdata@data[,3:ncol(sdata@data)], p=0.05)
We observe that four variables: "mat30", "sar", "spost27", and "spost9" are multicollinear. There are however, variables that can act as "hingepin" variables and cause other variables to appear multicolnear whent there are in fact, not redundant. Here we can perform a "leave one out" test to evaluate if any of the varialbes are forcing other varibles to appear multicollinear.
for(l in cl) {
cl.test <- sdata@data[,-which(names(sdata@data)==l)]
print(paste("Remove variable", l, sep=": "))
multi.collinear(cl.test, p=0.05)
}
We observe that foe each "leave one out" the other varibles remain muticolinear. This indicates that all four varibles are in fact redundant in multivarite space and should be removed from the model.
sdata@data <- sdata@data[,-which(names(sdata@data) %in% cl )]
Random Forests Model
We will now coerce the dependent variable (y) into factorial vector and calculate the percent of the positive (1) class to check for the sample balance (zero inflation issues) in the model.
sdata@data$Present <- as.factor(sdata@data$Present)
( dim(sdata[sdata$Present == 1, ])[1] / dim(sdata)[1] ) * 100
We observe that the sample balance of presence locations is 33% thus, meeting the 1/3 rule for sample balance.
Now we apply a model selection approach to fine the most parsimonious that is not fitting noise. This allows us to test the model parameters and select the model with the best error component.
( rf.model <- rf.modelSel(x=sdata@data[,3:ncol(sdata@data)], y=sdata@data[,"Present"], imp.scale="mir", ntree=b) )
We can then select a specific model or use the default selected model and then run a final Random Forests model. This allows us to assess the model and select a competing model that may exhibit similar error but a more desirable set of parameters.
First we select the default parameters
sel.vars <- rf.model$selvars
Or, alternatively, choose a specific model (in this case, parameters associated with model 4).
sel.vars <- rf.model$parameters[[4]]
We can then subset the parameters and run a fit model.
( rf.fit <- randomForest(y=sdata@data[,"Present"], x=sdata@data[,sel.vars], ntree=b, importance=TRUE,
norm.votes=TRUE, proximity=TRUE) )
Model Estimation (probability raster creation)
Using the final model we can then predict the probabilities of the positive class. Note the “index” argument in the predict function allows you to select a specific class for the estimated probabilities.
xvars <- stack(paste(getwd(), paste(rownames(rf.fit$importance), "img", sep="."), sep="/"))
predict(xvars, rf.fit, "model.probs.img", type="prob", index=2, na.rm=TRUE, overwrite=TRUE, progress="window")
Now we can read in the probability raster and plot the results with the presence/absence points in red and blue.
r <- raster("model.probs.img")
plot(r)
plot( sdata[sdata$Present == "1" ,], add = TRUE, col="red", pch=20, cex=0.75)
plot( sdata[sdata$Present == "0" ,], add = TRUE, col="blue", pch=20, cex=0.75)
Model Fit
Now we will perform an evaluation of model fit. We can separate this step into model fit and model validation or quality. The fit is based on the internal validation conducted against the withheld data in the Bootstrap.
First, we back predict to the data and look at the model fit. Because the model is fit to the data, this should be almost perfect.
rf.pred <- predict(rf.fit, sdata@data[,sel.vars], type="response")
rf.prob <- as.data.frame(predict(rf.fit, sdata@data[,sel.vars], type="prob"))
obs.pred <- data.frame(cbind(Observed=as.numeric(as.character(sdata@data[,"Present"])),
PRED=as.numeric(as.character(rf.pred)), Prob1=rf.prob[,2],
Prob0=rf.prob[,1]) )
op <- (obs.pred$Observed == obs.pred$PRED)
Here we display the percent correctly classified and the ROC plot.
( pcc <- (length(op[op == "TRUE"]) / length(op))*100 )
library(verification)
roc.plot(obs.pred[,"Observed"], obs.pred[,"Prob1"])
Model Validation
For validation we need some type of independent assessment of how our model performed. Because we do not want to leave data out of our model we can take a Bootstrap approach and evaluate model performance by using a permutation against random and with a percent leave-out n times.
First we will test model significance against random by breaking the relationship between x and y. For brevities sake we will only run 99 permutations but one should run at least 999.
( rf.perm <- rf.significance(rf.fit, sdata@data[,sel.vars], nperm = 99, ntree = 1001) )
Second, we can perform a cross validation, where a percentage of the data is removed, the model run on the subset data and then validated against the withheld data. This allows us to evaluate model performance against independent data as well as look at performance against variation in the data structure. Keep in mind that you are evaluating an error distribution representing multiple realizations of the data.
( rf.cv <- rf.crossValidation(rf.fit, sdata@data[,sel.vars], p=0.10, n=99, ntree=1001) )
Random Forests Plots
We can generate a variety of plots represent model error convergence, variable importance, partial plots representing the relationship between the range of a given variable and the estimated probability distribution and some novel plots representing the multivariate dissimilarity of the proximity matrix.
We can plot the random forests object using randomForests default plot function to assess the models error convergence. This is one way we can refine the number of Bootstrap replicates required to stabilize the model error. One should keep in mind that variable interaction stabilizes at a much slower rate than error. As a rule of thumb I use twice the error convergence. However, there is a balance between error convergence, variable interaction convergence and overfit. It is a misnomer that one cannot overfit a Random Forests model. If the ensemble is overly correlated (redundant Bootstrap samples) then the model can overfit.
plot(rf.fit, main="Bootstrap Error Convergence")
Plot the scaled variable importance. This indicates the contribution that each retained variable has to the final model.
p <- as.matrix(rf.fit$importance[,3])
ord <- rev(order(p[,1], decreasing=TRUE)[1:dim(p)[1]])
dotchart(p[ord,1], main="Scaled Variable Importance", pch=19)
We can plot the range of a variable against the estimate probability using partial plots. This function applies a lowess polynomial regression to smooth stochasticity and plots both the smoothed and raw probability distribution. Here we will plot the probability partial plots for the first four variables in our model.
par(mfrow=c(2,2))
for(i in sel.vars[1:4]) {
rf.partial.prob(rf.fit, sdata@data[,sel.vars], i, "1", smooth="spline", raw.line=FALSE)
}
To explore the multivariate relationships of the data we can create scaled multivariate distances (MDS) of the Random Forests proximities. This can also be done with the raw (x) independent variables.
rf.cmd <- cmdscale(1 - rf.fit$proximity, eig=TRUE, k=4)
pa.col=ifelse(sdata@data[,"Present"] == "1", "red", ifelse(sdata@data[,"Present"] == "0", "black", NA))
rf.cmd <- data.frame(rf.cmd$points)
Plot the first 2 MDS dimensions.
plot(rf.cmd[,1:2], ylab="DIM 1", xlab="DIM 2", pch=16, col=pa.col, main= paste("Pres/Abs", "PROXIMITY MATRIX MDS d=2", sep=" - "))
legend("topright",pch=c(16,16), col=c("black", "red"),
legend=c("Absent","Present"))
3D plots of the first MDS dimensions. Note the added information portrayed by the addition of the third dimension. We will need the "rgl" package to generate the 3D plot.
library(rgl)
plot3d(rf.cmd[,1],rf.cmd[,2],rf.cmd[,3], col=pa.col, pch=18, size=1.25, type="s", xlab="MDS dim 1", ylab="MDS dim 2",
zlab="MDS dim 3")