Manipulating spatial vector data in R

Welcome to "Manipulating spatial vector data in R". In this installment I will focus on manipulating spatial data classes. I will also introduce some basic distance and proximity analysis and autocorrelation. We will be using “sp”, which you may recall is the library that supports spatial objects in R, and “spdep” to support distance and proximity analysis.

We will start with bracket indexing. This is a very handy way of creating data subsets based on a positional index or attribute query.

Add required libraries

    library(sp)
    library(spdep)
    library(rgdal)

We will use the meuse dataset that is distributed with sp. Here we add the meuse data.frame object from sp library and coerce to sp object. We also display row/col dimensions and plot data.

    data(meuse)
    coordinates(meuse) <- ~x+y
    dim(meuse)
    plot(meuse, pch=19)

Using a bracket index, we will subset the first 10 observations. Remember that a comma at the beginning of the bracket statment represents columns and after, rows.

    meuse.sub <- meuse[1:10,]
      dim(meuse.sub)
        plot(meuse.sub, pch=19)

We can extend this concept to create a random sample (n=10) based on a row index using “sample”.

meuse.rs <- meuse[sample(seq(1:nrow(meuse)),10),]
plot(meuse.rs, pch=19)

We can also use this type of index to query data. Let’s look at the attributes in in the data slot.

str(meuse@data)

We can quickly look at the statistical moments of a given variable using summary.

summary(meuse@data$cadmium)

Let’s extract the 50th percentile (median) for cadmium using an index. Since summary results in a vector we can use an index without a comma. Median is in the third position so we use:

( p50 <- summary(meuse@data$cadmium)[3] )

We can now use the p50 object to query the data and subset to cadmium >= median

cad.sub <- meuse[meuse$cadmium >= p50 ,]

We can be even more efficient and, rather than creating a new, temporary object, we can use the summary function directly.

    cad.sub <- meuse[meuse$cadmium >= summary(meuse@data$cadmium)[3] ,]
      dim(cad.sub)
        plot(cad.sub, pch=19)

Query type indexing can also be used in calculations without subsetting data to a new object. Here we will calculate the percent of observations above the 75th percentile using dim.

dim(meuse[meuse$cadmium >= summary(meuse@data$cadmium)[4] ,])[1] / dim(meuse)[1]

Now let’s revisit where we left off in the last blog entry. We will create a for loop that subsets data based on the “soil” attribute in meuse. We use “unique” to list unique values in “soil”. The “assign” command is used to name each object and “paste” creates the text string of the object name.

    str(meuse@data)
        for( i in unique(meuse@data$soil) ) {
          assign(paste("soil", i, sep=""), meuse[meuse$soil == i ,])
        }

Now if we list the objects in our R session you will see that there are now SpatialPointsDataFrame objects for: "soil1", "soil2" and "soil3" that represent the spatial subsets. We can plot the three classes in a single plot although, keep in mind that there are ways to plot the classes without separating meuse into 3 separate objects. This is just to illustrate creating new objects based on querying the data.

ls()

class(soil2)

    plot(soil1, pch=19, col="black")
    plot(soil2, pch=19, col="red", add=TRUE)
    plot(soil3, pch=19, col="blue", add=TRUE)

Let’s start in on some distance analysis. We can use internal functions in R to summarize and query distances in relation to each observation.

We will start with a nearest neighbor (KNN) analysis to identify the nearest 4 points.

meuse.knn <- knearneigh(coordinates(meuse), k=4)

Look at the first six observations of the neighbor matrix. The row number corresponds to the row number of meuse and each column is the row index of the nearest neighbor.

head(meuse.knn$nn)

We can plot the linkages of k=4 using a graph structure

    plot(meuse, pch=19)
     plot(knn2nb(meuse.knn), coordinates(meuse), add=TRUE)
       title(main="K nearest neighbours, k=4")

Now we can subset the k=4 nearest observations for the fifth observation in meuse.

nn1.ids <- as.vector(meuse.knn$nn[5,])
nn1 <- meuse[nn1.ids,]

And then plot original observation with 4 nearest neighbors.

plot(nn1, pch=19, col="red")
plot(meuse[5,], pch=19, col="black", add=TRUE)

Now let’s perform a similar analysis based on distance rather than proximity.

Create distance object using the dnearneigh function in the spdep library.

meuse.dist <- dnearneigh(coordinates(meuse), 0.0001, 1000)

Coerce distance object to a list object with distances for each observation in meuse

dist.list <- nbdists(meuse.dist, coordinates(meuse))

Create a new column with the distance to the nearest observation using lapply and unlist

meuse@data <- data.frame(meuse@data, NNDist=unlist(lapply(dist.list, FUN=function(x) min(x))))

Plot results using spplot

spplot(meuse, "NNDist", col.regions=colorRampPalette(c("blue","yellow","red"), interpolate="spline")(10) )

Now that we have some of the basics of data manipulations and distance summary down, let’s start looking at 1st and 2nd order autocorrelation statistics. Autocorrelation is a measure of similarity and configuration in spatial data. When I refer to 1st order autocorrelation it is in relation to the global spatial trend in the data. Alternatively, 2nd order autocorrelation quantifies local spatial structure. This can also be referred to as nonstationarity or spatial outliers. The spdep package is specifically designed to quantify spatial dependency. We will start with global autocorrelation that will results in a single coefficient that represents autocorrelation across all of the data. The Moran’s-I ranges -1 to 1, where 1 represents clustered, -1 dispersed and 0 random.

Spatial autocorrelation

To model spatial autocorrelation we first need to create a spatial weights matrix (Wij). There are many ways to construct Wij including binary contingency, but we will focus on a distance matrix.

    nm <- knn2nb(knearneigh(meuse))
      all.linked <- max(unlist(nbdists(nm, meuse)))
        nb <- dnearneigh(meuse, 0, all.linked)
          colW <- nb2listw(nb, style="W")

Here we calculate the Morna’s-I

moran(meuse@data[,"cadmium"], colW, length(nb), Szero(colW))

We would also like to see the variance and expectation. There is a test statistic for this.

moran.test(meuse@data[,"cadmium"], nb2listw(nb, style="W"))

Even though we can see that we can accept the expected value the Morna’s-I uses the mean. As such, If the variable being tested is non-normally distributed the statistic is invalid. A way to account for this as well as test significance is a Monte Carlo permutation test using “moran.mc”. We will also plot the simulation. Given the “greater” hypothesis, we want to see the Moran’s-I value in the upper tail of the distribution.

( Iperm <- moran.mc(meuse@data[,"cadmium"], listw=colW, nsim=999, alternative="greater") )

plot(Iperm)

Now we will look at local autocorrelation using the Local Indicators of Spatial Association (LISA) statistic, which is a local version of the Moran’s-I. The 2nd order statistics represent localized spatial structure and indicate spatial outliers where an observation may represent a value significantly different than its neighbors. This is often interpreted as low values surrounded by high, high surrounded by low, high intermixed with high, low intermixed with low and not-significant.

Create a spatial weights matrix (Wij) using a distance bandwidth.

nb <- dnearneigh(meuse, 0, 500)

Calculate the local Moran’s-I (LISA)

mI <- localmoran(meuse@data[,"cadmium"], nb2listw(nb, style="B"))

Create a new sp point object from meuse that hold the local Moran’s-I results.

    LocalI <- meuse
    LocalI@data <- data.frame(ID=rownames(LocalI@data), as.data.frame(mI))
      names(LocalI@data)[6] <- "Pr"

Return the number of observations that are significant at p=0.05

dim( LocalI[LocalI@data[,"Pr"] < 0.05 ,] )

Plot the frequency of the p-values

hist(LocalI@data[,"Pr"], col="blue", xlab="p-value")

Plot the critical values (z) using spplot.

spplot(LocalI, "Z.Ii", xlab="Local Morans-I", col.regions=topo.colors(50))

The high and low have no distinct I values so one must create a vector to distinguish significant and high hotspots. Red represents hot spots. This is a spatial points object that can be written out as a shape file or ASCII flat file with associated coordinates.

    LocalI@data <- data.frame(LocalI@data, HotSpots=ifelse( mI[,5] <= 0.05 & mI[,4] >= mean(mI[,4]), 1, 0) )
    LocalI@data$HotSpots <- as.factor(LocalI@data$HotSpots)
    spplot(LocalI, "HotSpots", xlab="Local Moran’s-I Hot Spots", col.regions=c("blue","red") )

In "Raster analysis in R" I will cover basic raster analysis and in "Plotting spatial objects in R" I will cover plotting using the base plotting engine.