HES 505 Fall 2023: Session 22
By the end of today you should be able to:
Identify nearest neighbors based on distance
Describe and implement overlay analyses
Extend overlay analysis to statistical modeling
Generate spatial predictions from statistical models
Moran’s I coefficient is the slope of the regression of the lagged asthma percentage vs. the asthma percentage in the tract
More generally it is the slope of the lagged average to the measurement
cdc$casthma_cr
0.1670774
We can generate the expected distribution of Moran’s I coefficients under a Null hypothesis of no spatial autocorrelation
Using permutation and a loop to generate simulations of Moran’s I
n <- 400L # Define the number of simulations
I.r <- vector(length=n) # Create an empty vector
for (i in 1:n){
# Randomly shuffle income values
x <- sample(cdc$casthma_cr, replace=FALSE)
# Compute new set of lagged values
x.lag <- lag.listw(lw.nearest, x)
# Compute the regression slope and store its value
M.r <- lm(x.lag ~ x)
I.r[i] <- coef(M.r)[2]
}
Pseudo p-value (based on permutations)
Analytically (sensitive to deviations from assumptions)
Using Monte Carlo
[1] 0.002493766
Moran I test under randomisation
data: cdc$casthma_cr
weights: lw.nearest
Moran I statistic standard deviate = 61.661, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
1.670774e-01 -4.990020e-04 7.385950e-06
Monte-Carlo simulation of Moran I
data: cdc$casthma_cr
weights: lw.nearest
number of simulations + 1: 401
statistic = 0.16708, observed rank = 401, p-value = 0.002494
alternative hypothesis: greater
Methods for identifying optimal site selection or suitability
Apply a common scale to diverse or disimilar outputs
Define the problem.
Break the problem into submodels.
Determine significant layers.
Reclassify or transform the data within a layer.
Add or combine the layers.
Verify
Successive disqualification of areas
Series of “yes/no” questions
“Sieve” mapping
Reclassifying
Which types of land are appropriate
Assume relationships are really Boolean
No measurement error
Categorical measurements are known exactly
Boundaries are well-represented
\[ \begin{equation} F(\mathbf{s}) = \prod_{M=1}^{m}X_m(\mathbf{s}) \end{equation} \]
\[ \begin{equation} F(\mathbf{s}) = f(w_1X_1(\mathbf{s}), w_2X_2(\mathbf{s}), w_3X_3(\mathbf{s}), ..., w_mX_m(\mathbf{s})) \end{equation} \]
\(F(\mathbf{s})\) does not have to be binary (could be ordinal or continuous)
\(X_m(\mathbf{s})\) could also be extended beyond simply ‘suitable/not suitable’
Adding weights allows incorporation of relative importance
Other functions for combining inputs (\(X_m(\mathbf{s})\))
\[ \begin{equation} F(\mathbf{s}) = \frac{\sum_{i=1}^{m}w_iX_i(\mathbf{s})}{\sum_{i=1}^{m}w_i} \end{equation} \]
\(F(s)\) is now an index based of the values of \(X_m(\mathbf{s})\)
\(w_i\) can incorporate weights of evidence, uncertainty, or different participant preferences
Dividing by \(\sum_{i=1}^{m}w_i\) normalizes by the sum of weights
\[ \begin{equation} F(\mathbf{s}) = w_0 + \sum_{i=1}^{m}w_iX_i(\mathbf{s}) + \epsilon \end{equation} \]
If we estimate \(w_i\) using data, we specify \(F(s)\) as the outcome of regression
When \(F(s)\) is binary → logistic regression
When \(F(s)\) is continuous → linear (gamma) regression
When \(F(s)\) is discrete → Poisson regression
Assumptions about \(\epsilon\) matter!!
To identify important correlations between predictors and the occurrence of an event
Generate maps of the ‘range’ or ‘niche’ of events
Understand spatial patterns of event co-occurrence
Forecast changes in event distributions
From Long
Spatially referenced locations of events \((\mathbf{y})\) sampled from the study extent
A matrix of predictors \((\mathbf{X})\) that can be assigned to each event based on spatial location
Goal: Estimate the probability of occurrence of events across unsampled regions of the study area based on correlations with predictors
Random or systematic sample of the study region
The presence (or absence) of the event is recorded for each point
Hypothesized predictors of occurrence are measured (or extracted) at each point
We can model favorability as the probability of occurrence using a logistic regression
A link function maps the linear predictor \((\mathbf{x_i}'\beta + \alpha)\) onto the support (0-1) for probabilities
Estimates of \(\beta\) can then be used to generate ‘wall-to-wall’ spatial predictions
\[ \begin{equation} y_{i} \sim \text{Bern}(p_i)\\ \text{link}(p_i) = \mathbf{x_i}'\beta + \alpha \end{equation} \]
Inputs from the dismo
package
The sample data
Simple feature collection with 6 features and 1 field
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -106.75 ymin: 31.25 xmax: -98.75 ymax: 37.75
Geodetic CRS: GCS_unknown
y geometry
1 0 POINT (-99.25 35.25)
2 1 POINT (-98.75 36.25)
3 1 POINT (-106.75 35.25)
4 0 POINT (-100.75 31.25)
5 1 POINT (-99.75 37.75)
6 1 POINT (-104.25 36.75)
Building our dataframe
ID MeanAnnTemp TotalPrecip PrecipWetQuarter PrecipDryQuarter MinTempCold
1 1 155 667 253 71 350
2 2 147 678 266 66 351
3 3 123 261 117 40 329
4 4 181 533 198 69 348
5 5 127 589 257 48 338
6 6 83 438 213 38 278
TempRange
1 -45
2 -58
3 -64
4 -5
5 -81
6 -107
Building our dataframe
ID MeanAnnTemp TotalPrecip PrecipWetQuarter
Min. : 1.00 Min. :-3.3729 Min. :-1.3377 Min. :-1.6926
1st Qu.: 25.75 1st Qu.:-0.4594 1st Qu.:-0.7980 1st Qu.:-0.6895
Median : 50.50 Median : 0.2282 Median :-0.2373 Median :-0.2224
Mean : 50.50 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 75.25 3rd Qu.: 0.7118 3rd Qu.: 0.7140 3rd Qu.: 0.6508
Max. :100.00 Max. : 1.4285 Max. : 2.4843 Max. : 2.2713
PrecipDryQuarter MinTempCold TempRange
Min. :-1.0828 Min. :-3.9919 Min. :-2.7924
1st Qu.:-0.7013 1st Qu.:-0.0598 1st Qu.:-0.5216
Median :-0.3770 Median : 0.3582 Median : 0.2075
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.4290 3rd Qu.: 0.5495 3rd Qu.: 0.6450
Max. : 3.1713 Max. : 1.1092 Max. : 2.0407
Looking at correlations
Looking at correlations
Fitting some models
pts.df <- cbind(pts.df, pres.abs$y)
colnames(pts.df)[8] <- "y"
logistic.global <- glm(y~., family=binomial(link="logit"), data=pts.df[,2:8])
logistic.simple <- glm(y ~ MeanAnnTemp + TotalPrecip, family=binomial(link="logit"), data=pts.df[,2:8])
logistic.rich <- glm(y ~ MeanAnnTemp + PrecipWetQuarter + PrecipDryQuarter, family=binomial(link="logit"), data=pts.df[,2:8])
Checking out the results
Call:
glm(formula = y ~ ., family = binomial(link = "logit"), data = pts.df[,
2:8])
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4461 0.5096 -2.837 0.00455 **
MeanAnnTemp -6.3578 6.1645 -1.031 0.30237
TotalPrecip 7.1453 4.5577 1.568 0.11694
PrecipWetQuarter -5.4207 3.0432 -1.781 0.07487 .
PrecipDryQuarter -1.3110 2.2482 -0.583 0.55981
MinTempCold 3.0890 2.6334 1.173 0.24080
TempRange -0.6213 4.5470 -0.137 0.89131
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 125.374 on 99 degrees of freedom
Residual deviance: 51.764 on 93 degrees of freedom
AIC: 65.764
Number of Fisher Scoring iterations: 7
Checking out the results
Call:
glm(formula = y ~ MeanAnnTemp + TotalPrecip, family = binomial(link = "logit"),
data = pts.df[, 2:8])
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.9880 0.3145 -3.141 0.00168 **
MeanAnnTemp -2.9990 0.6647 -4.512 6.42e-06 ***
TotalPrecip 0.3924 0.3827 1.025 0.30517
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 125.374 on 99 degrees of freedom
Residual deviance: 68.108 on 97 degrees of freedom
AIC: 74.108
Number of Fisher Scoring iterations: 6
Checking out the results
Call:
glm(formula = y ~ MeanAnnTemp + PrecipWetQuarter + PrecipDryQuarter,
family = binomial(link = "logit"), data = pts.df[, 2:8])
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.96504 0.35650 -2.707 0.00679 **
MeanAnnTemp -2.85446 0.66142 -4.316 1.59e-05 ***
PrecipWetQuarter 0.03212 0.43102 0.075 0.94060
PrecipDryQuarter 0.16759 0.64935 0.258 0.79634
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 125.374 on 99 degrees of freedom
Residual deviance: 69.006 on 96 degrees of freedom
AIC: 77.006
Number of Fisher Scoring iterations: 6
Comparing models
Generating predictions
Generating predictions
Generating predictions
Generating predictions
Dependent variable must be binary
Observations must be independent (important for spatial analyses)
Predictors should not be collinear
Predictors should be linearly related to the log-odds
Sample Size
Opportunistic collection of presences only
Hypothesized predictors of occurrence are measured (or extracted) at each presence
Background points (or pseudoabsences) generated for comparison
What constitutes background?
Not measuring probability, but relative likelihood of occurrence
Sampling bias affects estimation
The intercept
\[ \begin{equation} y_{i} \sim \text{Bern}(p_i)\\ \text{link}(p_i) = \mathbf{x_i}'\beta + \alpha \end{equation} \]