Quarto and literate programming

Quarto

This is a Quarto document (in fact, this whole webpage and all of the slides were built with Quarto). Quarto uses the knitr package to render files containing R, python, and julia to Markdown as a means of rendering code, text, math, figures, and tables to a variety of formats.

Workflow diagram starting with a qmd file, then knitr, then md, then pandoc, then PDF, MS Word, or HTML.

Markdown is a simple formatting syntax for authoring HTML documents (it’s the basis for the Readme docs that GitHub creates for you). From there, RStudio calls pandoc to render the markdown file into your chosen output format. I’m telling you this because there will be times when some part of this pipeline may break and you’ll need to know where the errors might be coming from.

You can create new Quarto documents by going to File >> New File >> New Quarto Document (or Presentation). There are lots of new documents devoted to Quarto, but some of them may assume you have some familiarity with Markdown or Rmarkdown. As such, I’m keeping this links to helpful Rmarkdown resources like this cheatsheet and a much longer user’s guide in case you need more in-depth discussion of some of the ideas behind authoring in Quarto. I don’t expect you to become an expert in Quarto, but it is a helpful way to keep all of your thoughts and code together in a single, coherent document. Getting proficient in Quarto and git allows you to work with collaborators on an analysis, graphics, and manuscript all within a single platform. This fully-integrated workflow takes practice and patience (especially when you have collaborators that are new to this approach), this course is just an initial step down that path. I’ll do my best to keep it simple - please let me know if you have questions!

The Example

Setup

The University of Exeter has been conducting an ongoing survey to understand the age at which the belief in Santa Claus begins to drop off. A sample of the data is located in your assignment01 folder. Our task is to bring the data into R, conduct some preliminary exploration of the data, and then fit a model to the data to see if age predicts belief in Santa. We’ll start by branching off of the master Quarto doc in our GitHub repo and then work through the steps together.

Pseudocode

Before we get started, let’s sketch out the steps in our analysis using pseudocode. If you take a look at the tasks I’ve outlined above, you might construct your pseudocode like this:

Code
LOAD: all packages that we need for the analysis
READ: Data located in isthereasanta.txt
CHECK: Data structure and values
CLEAN: Are there odd values?
PLOT: Age vs Belief
MODEL: GLM of Age vs. belief

Programming

Now that we have the basic steps in place, let’s transform the pseudocode into a repeatable Quarto document that explains what we’re doing, why, and what we found.

Load the packages

Part of what makes R so powerful for data analysis is the number of ready-made functions and packages that are designed for all the things. That said, you can’t take advantage of that power if you don’t load them into your session so that their functions become available. In general, it’s best to do that first thing your document so that other folks can see what packages are necessary before you start running analyses. If you pay attention when these packages load, you may see warnings that a function is masked. This happens because two (or more) packages have functions with the same name. We can be explicit about which version we want by using packagename::functionname(). You’ll see that more later this semester.

Code
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.3     ✔ purrr   1.0.2
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Read the Data

Based on our pseudocode our first step is the read the data. We can create headings in Quarto using different numbers of # symbols to keep things organized. The code below uses ``` to create the code chunk and then {r} to tell Quarto which environment to use when running it. I’m specifying a filepath because I’m not working within our git repo, this isn’t great practice, but it’s necessary for the webpage to render correctly. We use paste0 to combine the filepath with the file name (isthereasanta.txt) then read in the data using read_table.

Code
filepath <- "/Users/mattwilliamson/Google Drive/My Drive/TEACHING/Intro_Spatial_Data_R/Data/2022/assignment01/"
#READ
santa <- read_table(paste0(filepath, "isthereasanta.txt"))

Check out the Data

Now that we’ve got the data loaded and assigned it to the santa object. It’s always a good idea to take a look and make sure things look the way you expect, check for NAs, and get a basic understanding of the way your data is being represented by R. This process will get more involved once we start working with spatial data, but it’s good to get in the habit now. We’ll start by looking at the first few rows (using head()), then get a sense for the classes of data using str(), and check for any NAs.

Code
head(santa)
str(santa)
any(is.na(santa))

You’ll notice a few things. First, because we read this in using the read_table function, the result is a tibble. As such, head() returns both the data and the classes. This makes the result of str() largely redundant (note that if santa were a data.frame this would not be true). The combination of any() with is.na() asks whether any of the cells in santa have an NA value. You can see that there are NAs. Most statistical modeling functions in R don’t like NAs so we’ll try to clean those up here. Before we clean them, let’s try to learn what they are. We can use which() to identify the locations of the NAs.

Code
which(is.na(santa), arr.ind = TRUE)

We see that all of them are in the age column (our key predictor variable!). We could also have discovered this using summary().

Code
summary(santa)

Clean the data

Deciding how to clean NAs is an important decision. Many people choose to drop any incomplete records. We can do that with complete.cases() and see that the resulting object now has only 47 rows.

Code
santa_complete_cases <- santa[complete.cases(santa),]

Dropping the incomplete cases may seem like a “safe” approach, but what if there is some systematic reason for the data to be incomplete. Maybe older people are less likely to provide their age? If that’s the case, then dropping these cases may bias our dataset and the models that result. In that case, we may decide to “impute” values for the NAs based on some principled approach. We’ll talk more about what it means to take a principled approach to imputation later in this class. For now, let’s just try to strategies: 1 where we assign the mean() value of age and one where we assign the max() value (to reflect our hypothesis that older people may not provide their age). We’ll do this by using the ifelse() function. Note that we can only do this because all of the NAs are in a single column.

Code
santa_mean <- santa
santa_mean$Age <- ifelse(is.na(santa_mean$Age), round(mean(santa_mean$Age, na.rm=TRUE),digits=0), santa_mean$Age)

santa_max <- santa
santa_max$Age <- ifelse(is.na(santa_max$Age), max(santa_max$Age, na.rm=TRUE), santa_max$Age)

Plot the Data

Now that we have a few clean datasets, let’s just take a quick look to see if our intuition is correct about the relationship between age and belief in santa. The idea isn’t so much to “prove” your hypothesis, but rather to get to know your data better as a means of identifying potential outliers and thinking about the distribution of your data.

Code
plot(Believe ~ Age, data=santa_complete_cases, main="Age vs. Belief in Santa (complete cases)")

plot(Believe ~ Age, data=santa_mean, main="Age vs. Belief in Santa (Age at mean)")

plot(Believe ~ Age, data=santa_max, main="Age vs. Belief in Santa (Age at max)")

These plots highlight two things. First, because Believe is a logical variable, the only possible outcomes are 0 and 1. This means we can’t fit a typical linear regression (we’ll use a logistic regression instead). Also, we notice that our choice of imputation strategy makes a difference! Let’s fit some models and see what kind of difference it makes.

Fit Some Models

We’ll be using a generalized linear model for this analysis. The details will come up later, but for now, let’s keep it simple. The syntax for the glm() function is relatively straightforward. First we specify the model Believe ~ Age, then we tell it what family binomial(link="logit"), then we remind R of the data. We use the binomial family because there are only 2 possible outcomes (TRUE and FALSE).

Code
fit_complete_cases <- glm(Believe ~ Age, family=binomial(link="logit"), data=santa_complete_cases)
fit_mean <- glm(Believe ~ Age, family=binomial(link="logit"), data=santa_mean)
fit_max <- glm(Believe ~ Age, family=binomial(link="logit"), data=santa_max)

summary(fit_complete_cases)$coef
summary(fit_mean)$coef
summary(fit_max)$coef

We see the older a person is, the less likely they are to believe in Santa! We also see that the choice of how we handle NAs affects the size of the effect, but not the direction. In class, we’ll write a function to simulate some new data based on this model and see if our results are robust to different assumptions.

Rendering the document

When you click the Render button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.