Statistical Modelling II

HES 505 Fall 2023: Session 23

Matt Williamson

Objectives

By the end of today you should be able to:

  • Articulate the differences between statisitical learning classifiers and logistic regression

  • Describe several classification trees and their relationship to Random Forests

  • Describe MaxEnt models for presence-only data

Revisiting Classification

Favorability in General

\[ \begin{equation} F(\mathbf{s}) = f(w_1X_1(\mathbf{s}), w_2X_2(\mathbf{s}), w_3X_3(\mathbf{s}), ..., w_mX_m(\mathbf{s})) \end{equation} \]

  • Logistic regression treats \(f(x)\) as a (generalized) linear function

  • Allows for multiple qualitative classes

  • Ensures that estimates of \(F(\mathbf{s})\) are [0,1]

Key assumptions of logistic regression

  • Dependent variable must be binary

  • Observations must be independent (important for spatial analyses)

  • Predictors should not be collinear

  • Predictors should be linearly related to the log-odds

  • Sample Size

Beyond Linearity

  • Logistic (and other generalized linear models) are relatively interpretable

  • Probability theory allows robust inference of effects

  • Predictive power can be low

  • Relaxing the linearity assumption can help

Classification Trees

  • Use decision rules to segment the predictor space

  • Series of consecutive decision rules form a ‘tree’

  • Terminal nodes (leaves) are the outcome; internal nodes (branches) the splits

Classification Trees

  • Divide the predictor space (\(R\)) into \(J\) non-overlapping regions

  • Every observation in \(R_j\) gets the same prediction

  • Recursive binary splitting

  • Pruning and over-fitting

An Example

Inputs from the dismo package

An Example

The sample data

head(pres.abs)
Simple feature collection with 6 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -106.75 ymin: 31.25 xmax: -98.75 ymax: 37.75
Geodetic CRS:  GCS_unknown
  y              geometry
1 0  POINT (-99.25 35.25)
2 1  POINT (-98.75 36.25)
3 1 POINT (-106.75 35.25)
4 0 POINT (-100.75 31.25)
5 1  POINT (-99.75 37.75)
6 1 POINT (-104.25 36.75)

An Example

Building our dataframe

pts.df <- terra::extract(pred.stack, vect(pres.abs), df=TRUE)
head(pts.df)
  ID MeanAnnTemp TotalPrecip PrecipWetQuarter PrecipDryQuarter MinTempCold
1  1         155         667              253               71         350
2  2         147         678              266               66         351
3  3         123         261              117               40         329
4  4         181         533              198               69         348
5  5         127         589              257               48         338
6  6          83         438              213               38         278
  TempRange
1       -45
2       -58
3       -64
4        -5
5       -81
6      -107

An Example

Building our dataframe

pts.df[,2:7] <- scale(pts.df[,2:7])
summary(pts.df)
       ID          MeanAnnTemp       TotalPrecip      PrecipWetQuarter 
 Min.   :  1.00   Min.   :-3.3729   Min.   :-1.3377   Min.   :-1.6926  
 1st Qu.: 25.75   1st Qu.:-0.4594   1st Qu.:-0.7980   1st Qu.:-0.6895  
 Median : 50.50   Median : 0.2282   Median :-0.2373   Median :-0.2224  
 Mean   : 50.50   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 75.25   3rd Qu.: 0.7118   3rd Qu.: 0.7140   3rd Qu.: 0.6508  
 Max.   :100.00   Max.   : 1.4285   Max.   : 2.4843   Max.   : 2.2713  
 PrecipDryQuarter   MinTempCold        TempRange      
 Min.   :-1.0828   Min.   :-3.9919   Min.   :-2.7924  
 1st Qu.:-0.7013   1st Qu.:-0.0598   1st Qu.:-0.5216  
 Median :-0.3770   Median : 0.3582   Median : 0.2075  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.4290   3rd Qu.: 0.5495   3rd Qu.: 0.6450  
 Max.   : 3.1713   Max.   : 1.1092   Max.   : 2.0407  

An example

  • Fitting the classification tree
library(tree)
pts.df <- cbind(pts.df, pres.abs$y)
colnames(pts.df)[8] <- "y"
pts.df$y <- as.factor(ifelse(pts.df$y == 1, "Yes", "No"))
tree.model <- tree(y ~ . , pts.df)
plot(tree.model)
text(tree.model, pretty=0)

An example

  • Fitting the classification tree
summary(tree.model)

Classification tree:
tree(formula = y ~ ., data = pts.df)
Variables actually used in tree construction:
[1] "TempRange"        "PrecipWetQuarter" "ID"               "MeanAnnTemp"     
Number of terminal nodes:  8 
Residual mean deviance:  0.3164 = 29.11 / 92 
Misclassification error rate: 0.07 = 7 / 100 

Benefits and drawbacks

Benefits

  • Easy to explain

  • Links to human decision-making

  • Graphical displays

  • Easy handling of qualitative predictors

Costs

  • Lower predictive accuracy than other methods

  • Not necessarily robust

Random Forests

  • Grow 100(000s) of trees using bootstrapping

  • Random sample of predictors considered at each split

  • Avoids correlation amongst multiple predictions

  • Average of trees improves overall outcome (usually)

  • Lots of extensions

An example

  • Fitting the Random Forest
library(randomForest)
class.model <- y ~ .
rf2 <- randomForest(class.model, data=pts.df)
varImpPlot(rf2)

Modelling Presence-Background Data

The sampling situation

  • Opportunistic collection of presences only

  • Hypothesized predictors of occurrence are measured (or extracted) at each presence

  • Background points (or pseudoabsences) generated for comparison

The Challenge with Background Points

  • What constitutes background?

  • Not measuring probability, but relative likelihood of occurrence

  • Sampling bias affects estimation

  • The intercept

\[ \begin{equation} y_{i} \sim \text{Bern}(p_i)\\ \text{link}(p_i) = \mathbf{x_i}'\beta + \alpha \end{equation} \]

MaxEnt

  • Opportunistic collection of presences only

  • Hypothesized predictors of occurrence are measured (or extracted) at each presence

  • Background points (or pseudoabsences) generated for comparison

Maximum Entropy models

  • MaxEnt (after the original software)

  • Need plausible background points across the remainder of the study area

  • Iterative fitting to maximize the distance between predictions generated by a spatially uniform model

  • Tuning parameters to account for differences in sampling effort, placement of background points, etc

  • Development of the model beyond the scope of this course, but see Elith et al. 2010

Challenges with MaxEnt

  • Not measuring probability, but relative likelihood of occurrence

  • Sampling bias affects estimation (but can be mitigated using tuning parameters)

  • Theoretical issues with background points and the intercept

  • Recent developments relate MaxEnt (with cloglog links) to Inhomogenous Point Process models

Extensions

  • Polynomial, splines, piece-wise regression

  • Neural nets, Support Vector Machines, many many more