HES 505 Fall 2022: Session 17
By the end of today you should be able to:
Distinguish deterministic and stochastic processes
Define autocorrelation and describe its estimation
Articulate the benefits and drawbacks of autocorrelation
Leverage point patterns and autocorrelation to interpolate missing data
Vizualization and the detection of patterns
The challenge of geographic data
Implications for analysis
A spatial process is a description of how a spatial pattern might be generated
Generative models
An observed pattern as a possible realization of an hypothesized process
\[ z = 2x + 3y + d \]
Considering each outcome as the realization of a process allows us to generate expected values
The simplest spatial process is Completely Spatial Random (CSR) process
First Order effects: any event has an equal probability of occurring in a location
Second Order effects: the location of one event is independent of the other events
We can use quadrat counts to estimate the expected number of events in a given area
The probability of each possible count is given by:
\[ P(n,k) = {n \choose x}p^k(1-p)^{n-k} \]
\[ \begin{equation} P(k,n,x) = {n \choose k}\bigg(\frac{1}{x}\bigg)^k\bigg(\frac{x-1}{x}\bigg)^{n-k} \end{equation} \]
‘everything is usually related to all else but those which are near to each other are more related when compared to those that are further away’.
From Manuel Gimond
Use spdep
package
Estimate neighbors
Generate weighted average
set.seed(2354)
# Load the shapefile
s <- readRDS(url("https://github.com/mgimond/Data/raw/gh-pages/Exercises/fl_hr80.rds"))
# Define the neighbors (use queen case)
nb <- poly2nb(s, queen=TRUE)
# Compute the neighboring average homicide rates
lw <- nb2listw(nb, style="W", zero.policy=TRUE)
#estimate Moran's I
moran.test(s$HR80,lw, alternative="greater")
Moran I test under randomisation
data: s$HR80
weights: lw
Moran I statistic standard deviate = 1.8891, p-value = 0.02944
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.136277593 -0.015151515 0.006425761
Spatial autocorrelation threatens second order randomness
Areal data means an infinite number of potential distances
Neighbor matrices, \(\boldsymbol W\), allow different characterizations
Goal: estimate the value of \(z\) at new points in \(\mathbf{x_i}\)
Most useful for continuous values
Nearest-neighbor, Inverse Distance Weighting, Kriging
find \(i\) such that \(| \mathbf{x_i} - \mathbf{x}|\) is minimized
The estimate of \(z\) is \(z_i\)
data(meuse)
r <- rast(system.file("ex/meuse.tif", package="terra"))
sfmeuse <- st_as_sf(meuse, coords = c("x", "y"), crs=crs(r))
nodes <- st_make_grid(sfmeuse,
cellsize = 25,
what = "centers")
dist <- distance(vect(nodes), vect(sfmeuse))
nearest <- apply(dist, 1, function(x) which(x == min(x)))
zinc_nn <- sfmeuse$zinc[nearest]
preds <- st_as_sf(nodes)
preds$zn <- zinc_nn
preds <- as(preds, "Spatial")
gridded(preds) <- TRUE
preds.rast <- rast(preds)
r.resamp <- resample(r, preds.rast)
preds.rast <- mask(preds.rast, r.resamp)
\[ \begin{equation} \hat{z}(\mathbf{x}) = \frac{\sum_{i=1}w_iz_i}{\sum_{i=1}w_i} \end{equation} \] where
\[ \begin{equation} w_i = | \mathbf{x} - \mathbf{x}_i |^{-\alpha} \end{equation} \] and \(\alpha > 0\) (\(\alpha = 1\) is inverse; \(\alpha = 2\) is inverse square)
terra::interpolate
provides flexible interpolation methods
Use the gstat
package to develop the formula
mgsf05 <- gstat(id = "zinc", formula = zinc~1, data=sfmeuse, nmax=7, set=list(idp = 0.5))
mgsf2 <- gstat(id = "zinc", formula = zinc~1, data=sfmeuse, nmax=7, set=list(idp = 2))
interpolate_gstat <- function(model, x, crs, ...) {
v <- st_as_sf(x, coords=c("x", "y"), crs=crs)
p <- predict(model, v, ...)
as.data.frame(p)[,1:2]
}
zsf05 <- interpolate(r, mgsf05, debug.level=0, fun=interpolate_gstat, crs=crs(r), index=1)
zsf2 <- interpolate(r, mgsf2, debug.level=0, fun=interpolate_gstat, crs=crs(r), index=1)
Previous methods predict \(z\) as a (weighted) function of distance
Treat the observations as perfect (no error)
If we imagine that \(z\) is the outcome of some spatial process such that:
\[ \begin{equation} z(\mathbf{x}) = \mu(\mathbf{x}) + \epsilon(\mathbf{x}) \end{equation} \]
then any observed value of \(z\) is some function of the process (\(\mu(\mathbf{x})\)) and some error (\(\epsilon(\mathbf{x})\))
Correlation the tendency for two variables to be related
Autocorrelation the tendency for observations that are closer (in space or time) to be correlated
Positive autocorrelation neighboring observations have \(\epsilon\) with the same sign
Negative autocorrelation neighboring observations have \(\epsilon\) with a different sign (rare in geography)
\[
\begin{equation}
z(\mathbf{x}) = \mu + \epsilon(\mathbf{x})
\end{equation}
\] * Specified in call to variogram
and gstat
as y~1
(or some other constant)
Assumes that the deterministic part of the process (\(\mu(\mathbf{x})\)) is now a function of the location \(\mathbf{x}\)
Could be the location or some other attribute
Now y
is a function of some aspect of x
relies on autocorrelation in \(\epsilon_1(\mathbf{x})\) for \(z_1\) AND cross correlation with other variables (\(z_{2...j}\))
Extending the ordinary kriging model gives:
\[ \begin{equation} z_1(\mathbf{x}) = \mu_1 + \epsilon_1(\mathbf{x})\\ z_2(\mathbf{x}) = \mu_2 + \epsilon_2(\mathbf{x}) \end{equation} \] * Note that there is autocorrelation within both \(z_1\) and \(z_2\) (because of the \(\epsilon\)) and cross-correlation (because of the location, \(\mathbf{x}\))
gstat
callsgCoK <- gstat(NULL, 'log.zinc', log(zinc)~1, meuse, locations=~x+y)
gCoK <- gstat(gCoK, 'elev', elev~1, meuse, locations=~x+y)
gCoK <- gstat(gCoK, 'cadmium', cadmium~1, meuse, locations=~x+y)
coV <- variogram(gCoK)
coV.fit <- fit.lmc(coV, gCoK, vgm(model='Sph', range=1000))
coK <- interpolate(r, coV.fit, debug.level=0)
nugget - the proportion of semivariance that occurs at small distances
sill - the maximum semivariance between pairs of observations
range - the distance at which the sill occurs
experimental vs. fitted variograms