Areal Data and Proximity

HES 505 Fall 2023: Session 20

Matt Williamson

Objectives

By the end of today you should be able to:

Describe and implement statistical approaches to interpolation
Describe the case for identifying neighbors with areal data
Implement contiguity-based neighborhood detection approaches
Implement graph-based neighborhood detection approaches

Statistical Interpolation

Previous methods predict \(z\) as a (weighted) function of distance
Treat the observations as perfect (no error)
If we imagine that \(z\) is the outcome of some spatial process such that:

Trend Surface Modeling

Basically a regression on the coordinates of your data points
Coefficients apply to the coordinates and their interaction
Relies on different functional forms

0th Order Trend Surface

Simplest form of trend surface
\(Z=a\) where \(a\) is the mean value of air quality
Result is a simple horizontal surface where all values are the same.

0th order trend surface

#set up interpolation grid
# Create an empty grid where n is the total number of cells
grd <- as.data.frame(spsample(as(id.cty, "Spatial"), "regular", n=20000))
names(grd)       <- c("X", "Y")
coordinates(grd) <- c("X", "Y")
gridded(grd)     <- TRUE  # Create SpatialPixel object
fullgrid(grd)    <- TRUE  # Create SpatialGrid object
proj4string(grd) <- proj4string(as(aq.sum, "Spatial"))
# Define the polynomial equation
f.0  <- as.formula(meanpm25 ~ 1)

# Run the regression model
lm.0 <- lm( f.0 , data=aq.sum)

# Use the regression model output to interpolate the surface
dat.0th <- SpatialGridDataFrame(grd, data.frame(var1.pred = predict(lm.0, newdata=grd)))

# Convert to raster object to take advantage of rasterVis' imaging
# environment
r   <- rast(dat.0th)
r.m <- mask(r, st_as_sf(id.cty))

tm_shape(r.m) + tm_raster( title="Predicted air quality") +tm_shape(aq.sum) + tm_dots(size=0.2) +
  tm_legend(legend.outside=TRUE)

1st Order Trend Surface

Creates a slanted surface
\(Z = a + bX + cY\)
X and Y are the coordinate pairs

1st Order Trend Surface

# Define the polynomial equation
f.1  <- as.formula(meanpm25 ~ X + Y)

aq.sum$X <- st_coordinates(aq.sum)[,1]
aq.sum$Y <- st_coordinates(aq.sum)[,2]

# Run the regression model
lm.1 <- lm( f.1 , data=aq.sum)

# Use the regression model output to interpolate the surface
dat.1st <- SpatialGridDataFrame(grd, data.frame(var1.pred = predict(lm.1, newdata=grd)))

# Convert to raster object to take advantage of rasterVis' imaging
# environment
r   <- rast(dat.1st)
r.m <- mask(r, st_as_sf(id.cty))

2nd Order Trend Surfaces

Produces a parabolic surface
\(Z = a + bX + cY + dX^2 + eY^2 + fXY\)
Highlights the interaction of both directions

2nd Order Trend Surfaces

# Define the 1st order polynomial equation
f.2 <- as.formula(meanpm25 ~ X + Y + I(X*X)+I(Y*Y) + I(X*Y))
 
# Run the regression model
lm.2 <- lm( f.2, data=aq.sum)

# Use the regression model output to interpolate the surface
dat.2nd <- SpatialGridDataFrame(grd, data.frame(var1.pred = predict(lm.2, newdata=grd))) 

r   <- rast(dat.2nd)
r.m <- mask(r, st_as_sf(id.cty))

tm_shape(r.m) + tm_raster(n=10, palette="RdBu", title="Predicted air quality") +tm_shape(aq.sum) + tm_dots(size=0.2) +
  tm_legend(legend.outside=TRUE)

Kriging

Previous methods predict \(z\) as a (weighted) function of distance
Treat the observations as perfect (no error)
If we imagine that \(z\) is the outcome of some spatial process such that:

\[ \begin{equation} z(\mathbf{x}) = \mu(\mathbf{x}) + \epsilon(\mathbf{x}) \end{equation} \]

then any observed value of \(z\) is some function of the process (\(\mu(\mathbf{x})\)) and some error (\(\epsilon(\mathbf{x})\))

Kriging exploits autocorrelation in \(\epsilon(\mathbf{x})\) to identify the trend and interpolate accordingly

Autocorrelation

Correlation the tendency for two variables to be related
Autocorrelation the tendency for observations that are closer (in space or time) to be correlated
Positive autocorrelation neighboring observations have \(\epsilon\) with the same sign
Negative autocorrelation neighboring observations have \(\epsilon\) with a different sign (rare in geography)

Ordinary Kriging

Assumes that the deterministic part of the process (\(\mu(\mathbf{x})\)) is an unknown constant (\(\mu\))

\[ \begin{equation} z(\mathbf{x}) = \mu + \epsilon(\mathbf{x}) \end{equation} \]

Steps for Ordinary Kriging

Removing any spatial trend in the data (if present).
Computing the experimental variogram, \(\gamma\), which is a measure of spatial autocorrelation.
Defining an experimental variogram model that best characterizes the spatial autocorrelation in the data.
Interpolating the surface using the experimental variogram.
Adding the kriged interpolated surface to the trend interpolated surface to produce the final output.

Removing Spatial Trend

Mean and variance need to be constant across study area
Trend surfaces indicate that is not the case
Need to remove that trend

f.2 <- as.formula(meanpm25 ~ X + Y + I(X*X)+I(Y*Y) + I(X*Y))
 
# Run the regression model
lm.2 <- lm( f.2, data=aq.sum)

# Copy the residuals to the point object
aq.sum$res <- lm.2$residuals

Removing the trend

Calculate the experimental variogram

nugget - the proportion of semivariance that occurs at small distances
sill - the maximum semivariance between pairs of observations
range - the distance at which the sill occurs
experimental vs. fitted variograms

A Note about Semivariograms

Fitted Semivariograms

Rely on functional forms to model semivariance

Calculate the experimental variogram

var.cld  <- gstat::variogram(res ~ 1, aq.sum, cloud = TRUE)
var.df  <- as.data.frame(var.cld)
index1  <- which(with(var.df, left==21 & right==2))

OP <- par( mar=c(4,6,1,1))
plot(var.cld$dist/1000 , var.cld$gamma, col="grey", 
     xlab = "Distance between point pairs (km)",
     ylab = expression( frac((res[2] - res[1])^2 , 2)) )

par(OP)

Simplifying the cloud plot

# Compute the sample experimental variogram
var.smpl <- gstat::variogram(f.2, aq.sum, cloud = FALSE)

bins.ct <- c(0, var.smpl$dist , max(var.cld$dist) )
bins <- vector()
for (i in 1: (length(bins.ct) - 1) ){
  bins[i] <- mean(bins.ct[ seq(i,i+1, length.out=2)] ) 
}
bins[length(bins)] <- max(var.cld$dist)
var.bins <- findInterval(var.cld$dist, bins)

# Point data cloud with bin boundaries
OP <- par( mar = c(5,6,1,1))
plot(var.cld$gamma ~ eval(var.cld$dist/1000), col=rgb(0,0,0,0.2), pch=16, cex=0.7,
     xlab = "Distance between point pairs (km)",
     ylab = expression( gamma ) )
points( var.smpl$dist/1000, var.smpl$gamma, pch=21, col="black", bg="red", cex=1.3)
abline(v=bins/1000, col="red", lty=2)

par(OP)

Looking at the sample Variogram

Estimating the sample variogram

var.smpl <- gstat::variogram(f.2, aq.sum, cloud = FALSE, cutoff = 1000000)


# Compute the variogram model by passing the nugget, sill and range values
# to fit.variogram() via the vgm() function.
dat.fit  <- gstat::fit.variogram(var.smpl, gstat::vgm(nugget = 12, range= 60000, model="Gau", cutoff=1000000))

Ordinary Kriging

[using ordinary kriging]

Ordinary Kriging

dat.krg <- gstat::krige( res~1, as(aq.sum, "Spatial"), grd, dat.fit)

Combining with the trend data

[using universal kriging]

Combining with the trend data

dat.krg <- gstat::krige( f.2, as(aq.sum, "Spatial"), grd, dat.fit)

r <- rast(dat.krg)$var1.pred
r.m <- mask(r, st_as_sf(id.cty))

# Plot the raster and the sampled points
tm_shape(r.m) + tm_raster(n=10, palette="RdBu", title="Predicted air quality") +tm_shape(aq.sum) + tm_dots(size=0.2) +
  tm_legend(legend.outside=TRUE)

Visualizing Uncertainty

Universal Kriging

Assumes that the deterministic part of the process (\(\mu(\mathbf{x})\)) is now a function of the location \(\mathbf{x}\)
Could be the location or some other attribute
Now y is a function of some aspect of x

vu <- variogram(log(zinc)~elev, ~x+y, data=meuse)
mu <- fit.variogram(vu, vgm(1, "Sph", 300, 1))
gUK <- gstat(NULL, "log.zinc", log(zinc)~elev, meuse, locations=~x+y, model=mu)
names(r) <- "elev"
UK <- interpolate(r, gUK, debug.level=0)

Universal Kriging

vu <- variogram(log(zinc)~x + x^2 + y + y^2, ~x+y, data=meuse)
mu <- fit.variogram(vu, vgm(1, "Sph", 300, 1))
gUK <- gstat(NULL, "log.zinc", log(zinc)~x + x^2 + y + y^2, meuse, locations=~x+y, model=mu)
names(r) <- "elev"
UK <- interpolate(r, gUK, debug.level=0)

Universal Kriging

Co-Kriging

relies on autocorrelation in \(\epsilon_1(\mathbf{x})\) for \(z_1\) AND cross correlation with other variables (\(z_{2...j}\))
Extending the ordinary kriging model gives:

\[ \begin{equation} z_1(\mathbf{x}) = \mu_1 + \epsilon_1(\mathbf{x})\\ z_2(\mathbf{x}) = \mu_2 + \epsilon_2(\mathbf{x}) \end{equation} \] * Note that there is autocorrelation within both \(z_1\) and \(z_2\) (because of the \(\epsilon\)) and cross-correlation (because of the location, \(\mathbf{x}\))

Not required that all variables are measured at exactly the same points

Co-Kriging

Process is just a linked series of gstat calls

gCoK <- gstat(NULL, 'log.zinc', log(zinc)~1, meuse, locations=~x+y)
gCoK <- gstat(gCoK, 'elev', elev~1, meuse, locations=~x+y)
gCoK <- gstat(gCoK, 'cadmium', cadmium~1, meuse, locations=~x+y)
coV <- variogram(gCoK)
coV.fit <- fit.lmc(coV, gCoK, vgm(model='Sph', range=1000))

coK <- interpolate(r, coV.fit, debug.level=0)