Jeffrey Evans
Quantitative Methods in Spatial Ecology
Clustering (Aspatial and Spatial) using R
Cluster analysis is the process of using a statistical of mathematical model to find regions that are similar in multivariate space. This tutorial will cover basic clustering techniques. Clustering can be performed on spatial locations or attribute data.
First, let’s add the R libraries we will need
require(sp)
require(cluster)
require(e1071)
require(RColorBrewer)
require(spatialEco)
We will create some example data with clearly defined clusters.
x <- rbind(cbind(rnorm(100,0,0.5), rnorm(100,0,0.5)),
cbind(rnorm(150,5,0.5), rnorm(150,5,0.5)))
The optimal.k function in the spatialEco library finds the optimal supported k (number of clusters) based on the cluster silhouettes.
( clust <- optimal.k(x, 20, plot=TRUE, cluster=TRUE) )
Extract medoid cluster breaks and single variable medioid
clust$medoids
pam(x[,1], 1)$medoids
Join cluster and silhouette results to data
x <- data.frame(x, CLUSTER=clust$clustering)
x <- data.frame(x, SIL=clust$silinfo$widths[,"sil_width"])
head(x)
Plot cluster membership and fit
par(mfrow=c(2,2))
plot(x$X1, x$X2, pch=19, col=factor(x$CLUSTER), main="Cluster membership",
xlab="Cluster 1", ylab="Cluster 2")
legend("topleft", legend=c("Cluster 1", "Cluster 2"),
col=c("black","red"), pch=c(19,19), bg="white")
plot(silhouette(clust), col = c("red", "green"), main="Cluster mean silhouette's")
plot(clust, which.plots=1, main="K-MEDIOD FIT")
We can alternatively use withiness as a measure of optimal k and fit.
n=20
wss <- (ncol(x)-1)*sum(apply(x,2,var))
for (i in 2:n) wss[i] <- sum(kmeans(x,centers=i)$withinss)
plot(1:n, wss, type="b", main=paste("Optimal cluster", which(wss == max(wss)),
sep=" N="), xlab="Number of Clusters", ylab="Within groups sum of squares")
We find that 2 clusters are most supported. We will run a K-means using k=2
fit <- kmeans(x, 2)
Plot clusters against 1st 2 principal components and vary parameters for most readable graph
clusplot(x, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
We have been working with clear separability in our clusters and “hard boundary” results where discrete values are assigned to each cluster. What if we have “noisy” data or a gradient function or want a probability distribution associated with the clustered observations. An alternative is fuzzy-C means. Using scaled margin distance (distance of an observation in relation to the vector(s) separating clusters).
First we will create some data with a gradient between clusters and plot the data.
x <- rbind(cbind(rnorm(100,0,0.5), rnorm(100,0,0.5)),
cbind(rnorm(150,5,0.5), rnorm(150,5,0.5)))
x <- rbind(x, cbind(runif(100,min(x),max(x)),
runif(100,min(x),max(x))))
plot(x,pch=19)
We know from our previous analysis that we have strong support for k=2 so we will use this value in a fuzzy-C mean model and then assign resulting clusters and probabilities to our data.
( fcm <- cmeans(x, centers=2, dist="euclidean", method="cmeans") )
fcm$withinerror
x <- data.frame(x, CLUSTER=fcm$cluster, fcm$membership)
names(x) [4:5] <- c("ProbsC1","ProbsC2")
We then plot the results.
par(mfrow=c(1,2))
plot(x$X1, x$X2, pch=19, col=factor(x$CLUSTER), main="Cluster membership",
xlab="Cluster 1", ylab="Cluster 2")
legend("topleft", legend=c("Cluster 1", "Cluster 2"),
col=c("black","red"), pch=c(19,19), bg="white")
pRamp <- colorRampPalette(c("red","yellow","blue"))
x$Col <- pRamp(20)[as.numeric(cut(x$ProbsC1, breaks=20))]
plot(x$X1, x$X2, pch=19, col=x$Col, main="Cluster probabilities",
xlab="Cluster 1", ylab="Cluster 2")
We have been performing aspatial clustering using attributes. To cluster based on spatial location we can perform a spatially constrained clustering using an agglomerative hierarchical approach.
We will use the meuse dataset as an example of spatial clustering
data(meuse)
coordinates(meuse) <- ~x+y
cdat <- data.frame(x=coordinates(meuse)[,1],y=coordinates(meuse)[,2])
rownames(cdat) <- rownames(meuse@data)
We can use the R base functions hclust and dist to specify scaled distances and perform a spatial hierarchical clustering.
chc <- hclust(dist(cdat))
Here we can identify clusters using the k Nearest Neighbors (k=6)
chc.n <- cutree(chc, k=6)
We can also identify clusters based on a distance threshold (d=200).
chc.d200 <- cutree(chc, h=200)
Join kNN and distance results to meuse sp points
meuse@data <- data.frame(meuse@data, KNN=as.factor(chc.n), DClust=chc.d200)
Here we plot the kNN and distance clustering results
par(mfcol=c(1,2))
plot(meuse, col=factor(meuse@data$KNN), pch=19)
box(col="black")
title(main="KNN Clustering")
legend("topleft", legend=paste("Cluster", 1:6,sep=""),
col=palette()[1:6], pch=rep(19,6), bg="white")
pRamp <- colorRampPalette(c("blue","yellow","red"))
Col <- pRamp(20)[as.numeric(cut(meuse@data$DClust, breaks=20))]
plot(meuse, col=Col, pch=19)
box(col="black")
title(main="Distance Clustering k=70")
One final approach we will cover is using oblique decision trees (Gaudar et al., 2015) available in the SPODT library.
Gaudart, J., N. Graffeo, D. Coulibaly, G. Barbet, S. Rebaudet, N. Dessay, O.K. Doumbo, R. Giorgi (2015) SPODT: An R Package to Perform Spatial Partitioning. Journal of Statistical Software 63(16)
library(SPODT)
gr = 0.07 # graft parameter
rtw = 0.01 # rtwo.min
parm = 25 # min.parent
childm = 2 # min.child
lmx = 7
Spatial clustering with one parameter
meuse.clust <- spodt(meuse@data[,3] ~ 1, meuse, weight=TRUE, graft=gr, min.ch=childm,
min.parent=parm, level.max=lmx, rtwo.min=rtw)
meuse.clust@partition
length(unique(meuse.clust@partition))
Now we can plot clusters and the associated TREE partitions
plot( spodtSpatialLines(meuse.clust, meuse) )
plot(meuse, pch=20, col=factor(meuse.clust@partition), add=TRUE)
Now extend the model to multiple covariates on the right side of the equation.
meuse.clust <- spodt(meuse$cadmium ~ meuse$copper + meuse$lead + meuse$zinc,
meuse, weight=TRUE, graft=gr, min.ch=childm,
min.parent=parm, level.max=lmx, rtwo.min=rtw)
meuse.clust@partition
length(unique(meuse.clust@partition))
plot( spodtSpatialLines(meuse.clust, meuse) )
plot(meuse, pch=20, col=factor(meuse.clust@partition), add=TRUE)