Clustering using R | evansspatial

Clustering (Aspatial and Spatial) using R

Cluster analysis is the process of using a statistical of mathematical model to find regions that are similar in multivariate space. This tutorial will cover basic clustering techniques. Clustering can be performed on spatial locations or attribute data.

First, let’s add the R libraries we will need

require(sp)

require(cluster)

require(e1071)

require(RColorBrewer)

require(spatialEco)

We will create some example data with clearly defined clusters.

x <- rbind(cbind(rnorm(100,0,0.5), rnorm(100,0,0.5)),

cbind(rnorm(150,5,0.5), rnorm(150,5,0.5)))

The optimal.k function in the spatialEco library finds the optimal supported k (number of clusters) based on the cluster silhouettes.

( clust <- optimal.k(x, 20, plot=TRUE, cluster=TRUE) )

Extract medoid cluster breaks and single variable medioid

clust$medoids

pam(x[,1], 1)$medoids

Join cluster and silhouette results to data

x <- data.frame(x, CLUSTER=clust$clustering)

x <- data.frame(x, SIL=clust$silinfo$widths[,"sil_width"])

head(x)

Plot cluster membership and fit

par(mfrow=c(2,2))

plot(x$X1, x$X2, pch=19, col=factor(x$CLUSTER), main="Cluster membership",

xlab="Cluster 1", ylab="Cluster 2")

legend("topleft", legend=c("Cluster 1", "Cluster 2"),

col=c("black","red"), pch=c(19,19), bg="white")

plot(silhouette(clust), col = c("red", "green"), main="Cluster mean silhouette's")

plot(clust, which.plots=1, main="K-MEDIOD FIT")

We can alternatively use withiness as a measure of optimal k and fit.

n=20

wss <- (ncol(x)-1)*sum(apply(x,2,var))

for (i in 2:n) wss[i] <- sum(kmeans(x,centers=i)$withinss)

plot(1:n, wss, type="b", main=paste("Optimal cluster", which(wss == max(wss)),

sep=" N="), xlab="Number of Clusters", ylab="Within groups sum of squares")

We find that 2 clusters are most supported. We will run a K-means using k=2

fit <- kmeans(x, 2)

Plot clusters against 1st 2 principal components and vary parameters for most readable graph

clusplot(x, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

We have been working with clear separability in our clusters and “hard boundary” results where discrete values are assigned to each cluster. What if we have “noisy” data or a gradient function or want a probability distribution associated with the clustered observations. An alternative is fuzzy-C means. Using scaled margin distance (distance of an observation in relation to the vector(s) separating clusters).

First we will create some data with a gradient between clusters and plot the data.

x <- rbind(cbind(rnorm(100,0,0.5), rnorm(100,0,0.5)),

cbind(rnorm(150,5,0.5), rnorm(150,5,0.5)))

x <- rbind(x, cbind(runif(100,min(x),max(x)),

runif(100,min(x),max(x))))

plot(x,pch=19)

We know from our previous analysis that we have strong support for k=2 so we will use this value in a fuzzy-C mean model and then assign resulting clusters and probabilities to our data.

( fcm <- cmeans(x, centers=2, dist="euclidean", method="cmeans") )

fcm$withinerror

x <- data.frame(x, CLUSTER=fcm$cluster, fcm$membership)

names(x) [4:5] <- c("ProbsC1","ProbsC2")

We then plot the results.

par(mfrow=c(1,2))

plot(x$X1, x$X2, pch=19, col=factor(x$CLUSTER), main="Cluster membership",

xlab="Cluster 1", ylab="Cluster 2")

legend("topleft", legend=c("Cluster 1", "Cluster 2"),

col=c("black","red"), pch=c(19,19), bg="white")

pRamp <- colorRampPalette(c("red","yellow","blue"))

x$Col <- pRamp(20)[as.numeric(cut(x$ProbsC1, breaks=20))]

plot(x$X1, x$X2, pch=19, col=x$Col, main="Cluster probabilities",

xlab="Cluster 1", ylab="Cluster 2")

We have been performing aspatial clustering using attributes. To cluster based on spatial location we can perform a spatially constrained clustering using an agglomerative hierarchical approach.

We will use the meuse dataset as an example of spatial clustering

data(meuse)

coordinates(meuse) <- ~x+y

cdat <- data.frame(x=coordinates(meuse)[,1],y=coordinates(meuse)[,2])

rownames(cdat) <- rownames(meuse@data)

We can use the R base functions hclust and dist to specify scaled distances and perform a spatial hierarchical clustering.

chc <- hclust(dist(cdat))

Here we can identify clusters using the k Nearest Neighbors (k=6)

chc.n <- cutree(chc, k=6)

We can also identify clusters based on a distance threshold (d=200).

chc.d200 <- cutree(chc, h=200)

Join kNN and distance results to meuse sp points

meuse@data <- data.frame(meuse@data, KNN=as.factor(chc.n), DClust=chc.d200)

Here we plot the kNN and distance clustering results

par(mfcol=c(1,2))

plot(meuse, col=factor(meuse@data$KNN), pch=19)

box(col="black")

title(main="KNN Clustering")

legend("topleft", legend=paste("Cluster", 1:6,sep=""),

col=palette()[1:6], pch=rep(19,6), bg="white")

pRamp <- colorRampPalette(c("blue","yellow","red"))

Col <- pRamp(20)[as.numeric(cut(meuse@data$DClust, breaks=20))]

plot(meuse, col=Col, pch=19)

box(col="black")

title(main="Distance Clustering k=70")

One final approach we will cover is using oblique decision trees (Gaudar et al., 2015) available in the SPODT library.

Gaudart, J., N. Graffeo, D. Coulibaly, G. Barbet, S. Rebaudet, N. Dessay, O.K. Doumbo, R. Giorgi (2015) SPODT: An R Package to Perform Spatial Partitioning. Journal of Statistical Software 63(16)

library(SPODT)
gr = 0.07 # graft parameter
    rtw = 0.01 # rtwo.min
parm = 25 # min.parent
    childm = 2 # min.child
    lmx = 7

Spatial clustering with one parameter

meuse.clust <- spodt(meuse@data[,3] ~ 1, meuse, weight=TRUE, graft=gr, min.ch=childm,
min.parent=parm, level.max=lmx, rtwo.min=rtw)
meuse.clust@partition
length(unique(meuse.clust@partition))

Now we can plot clusters and the associated TREE partitions

plot( spodtSpatialLines(meuse.clust, meuse)   )
plot(meuse, pch=20, col=factor(meuse.clust@partition), add=TRUE)

Now extend the model to multiple covariates on the right side of the equation.

meuse.clust <- spodt(meuse$cadmium ~ meuse$copper + meuse$lead + meuse$zinc,
                     meuse, weight=TRUE, graft=gr, min.ch=childm,
                     min.parent=parm, level.max=lmx, rtwo.min=rtw)
meuse.clust@partition
length(unique(meuse.clust@partition))

plot( spodtSpatialLines(meuse.clust, meuse)   )

plot(meuse, pch=20, col=factor(meuse.clust@partition), add=TRUE)