```{r setup, include=FALSE}
  knitr::opts_chunk$set(echo = TRUE)
  knitr::opts_chunk$set(dev = 'pdf')
  def.chunk.hook  <- knitr::knit_hooks$get("chunk")
  knitr::knit_hooks$set(chunk = function(x, options) {
    x <- def.chunk.hook(x, options)
    ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
  })
```

## Conditions for Validity in Linear Regression
Recall: these are the required conditions for statistical inference in the context of a linear model to be valid:

 - The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}
   - e.g. not quadratic, exponential, etc.

\vspace{2mm}

 - \textbf{I}ndependence of observations

\vspace{2mm}

 - \textbf{N}ormality of $\epsilon_i$
   - Note that this can also be achieved, due to the central limit theorem, with a large sample size even if $\epsilon_i$ does not follow a normal distribution

\vspace{2mm}

 - \textbf{E}qual variance across all values of $X$
   - Also known as homoskedasticity


## Example: March Madness knowledge vs. bracket scores
\vspace{1mm}
```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(readr)
library(ggplot2)
march_madness <- read_csv("march_madness.csv")

knowledge <- march_madness$knowledge
bracket_score <- march_madness$bracket_score

theme_update(text = element_text(size = 20))
ggplot(data = march_madness, aes(y = bracket_score, x=knowledge)) +
  geom_point() + geom_smooth(method='lm', se=FALSE) +
  xlab("knowledge score") + ylab("bracket score") + 
  ggtitle("Students from Villanova University in Spring 2019")
```

## Example: March Madness knowledge vs. bracket scores {.t}
\framesubtitle{The data}

1. Students from a class I was teaching in Spring 2019 at Villanova University were asked to fill out March Madness brackets for the NCAA Men's College Basketball Championship (n=50). 

\vspace{1mm}

2. They also filled out a questionnaire to assess their "basketball knowledge." The questions were things like: 
   - How many men's college basketball games did you watch this season?
   - On how many teams in the tournament can you name at least one player? Two players? The head coach?
   - When watching a basketball game, how often are you able to identify any particular strategies that are being used?

\vspace{1mm}
   
3. Their responses to the questions were summarized in a "knowledge score" from 0 to 10. 

\vspace{1mm}

4. Brackets were scored according to default ESPN settings (10 pts for Round 1 picks, 20 pts for Round 2 picks, etc).


## Example: March Madness knowledge vs. bracket scores {.t}
In Lab 4, you investigated the consequences of violations to the required conditions for validity on Type I Error rates in the context of simple linear regression, and also did a small bit of model diagnostics (specifically, the Q-Q plot). 

\vspace{3mm}

::: {.t}
### So what are we doing today? {.t}
\vspace{-2mm}
Today we will run through a complete set of model diagnostics for each of the LINE conditions

:::

Note: we can never actually know for sure if the conditions are met or not (just like we can never know if we had made a Type I Error). But, we CAN (and should) look at what the data suggest.

## Residual Analyses
::: {.block}
### Quote of the Day
\vspace{-2mm}
Machine learning is statistics minus any checking of models and \cancel{assumptions} conditions.
:::

\begin{minipage}{0.8\textwidth}
\begin{flushright}
Dr. Brian Ripley \\
Member of R Core Development Team \\
Professor of Applied Statistics (retired) \\
University of Oxford \\
\end{flushright}
\end{minipage}
\begin{minipage}{0.19\textwidth}
\includegraphics[width=\textwidth]{brian-d-ripley.jpg}
\end{minipage}



## Residual Analyses

::: {.t}
### Why do we do model diagnostics when doing statistical inference?
\vspace{-2mm}
Statistical tests have conditions that need to be satisfied in order for the test to be valid.
:::


\vspace{8mm}

::: {.t}
### Why don't we need model diagnostics when doing machine learning?
\vspace{-2mm}
The goal in that setting is model building and prediction! These things do not require any conditions to be satisfied.
:::

\vspace{3mm}

On the other hand, note that violation of conditions can lead to greater prediction error. However, as your machine learning workflow/algorithms will by definition basically seek to minimize prediction error, you don't usually care too much about how you got there. 


## Residual Analyses {.t}
Typically, regression model diagnostics proceed by utilizing the \underline{residuals} of the model. What's a residual again??


::: {.t}
### Recall from Lecture \#7:
\vspace{-2mm}
The Ordinary Least Squares Criterion is:
$$
RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

 - Each $\epsilon_i = y_i - \hat{y}_i$ is a residual.
 - $\hat{y}_i$ is the fitted value of $y$ for the $i^{th}$ point, according to the ordinary least squares regression line. 

:::

What are the residuals for the March Madness data?


## Residual Analysis
\vspace{1mm}
```{r, echo=FALSE, warning=FALSE, message=FALSE}
theme_update(text = element_text(size = 20))
ggplot(data = march_madness, aes(y = bracket_score, x=knowledge)) +
  geom_point() + geom_smooth(method='lm', se=FALSE) +
  xlab("knowledge score") + ylab("bracket score") + 
  ggtitle("Students from Villanova University in Spring 2019")
```


## Residual Analysis

Recall that the equation for the blue line came from:
```{r, size="scriptsize"}
model1 <- lm(bracket_score ~ knowledge, data=march_madness)
summary(model1)
```

## Residual Analysis {.t}
\vspace{-4mm}

$$
\hat{y}_i = `r round(model1$coefficients[1], 2)` + `r round(model1$coefficients[2], 2)` \cdot x_i
$$

where $x_i$ is the knowledge score of the $i^{th}$ participant. Here's a snapshot of the first few rows of the `march_madness` dataframe:

```{r, echo=FALSE}
knitr::kable(march_madness[1:6,9:10])
```

What are the residuals? https://pollev.com/chi

## Residual Analysis {.t}
Thankfully, we don't actually have to do that every time (I just wanted us to do it to make sure we know how to get them in principle). The `model1` object has them for us:

```{r, size="footnotesize"}
names(model1)
```

### Let's grab the first 6 just to compare to the table:
```{r, size="footnotesize"}
model1$residuals[1:6]
```

## Residual Analysis {.t}
\framesubtitle{Now, how do we use them?}

 - The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}
   - e.g. not quadratic, exponential, etc.

\vspace{2mm}

 - \textbf{I}ndependence of observations

\vspace{2mm}

 - \textbf{N}ormality of $\epsilon_i$
   - Note that this can also be achieved, due to the central limit theorem, with a large sample size even if $\epsilon_i$ does not follow a normal distribution

\vspace{2mm}

 - \textbf{E}qual variance across all values of $X$
   - Also known as homoskedasticity


## Residual Analysis {.t}
\framesubtitle{\textbf{E}qual variance across all values of $X$}
```{r, size="footnotesize", out.width="75%"}
march_madness$residuals <- model1$residuals
ggplot(data=march_madness, mapping=aes(x=knowledge, y=residuals)) + 
  geom_point() + labs(title="Residual Plot")
```



## Residual Analysis {.t}

### Some Questions
\vspace{-2mm}

 - What does this tell us that just looking at the original scatterplot doesn't?
   - Answer: Often, not that much! But, it CAN make patterns more apparent.

\pause
\vspace{4mm}

 - But wait so I'm supposed to just look at it and make a judgment call? What about a definitive answer??
   - Yeah... while you could conceivably cook up a statistical test for this question, it is not typically done. 

\pause   
\vspace{4mm}

 - Ok, so what judgement call would you make in this case?
   - Here, it looks like there is not a whole lot of difference in the vertical spread of the points across all values of $x$, so it looks like the condition of homoskedasticity is probably satisfied. 
   
## Residual Analysis {.t}
Here is another example:
```{r, echo=FALSE, out.width="90%"}
set.seed(10)
x <- runif(100, 0, 10)
y <- 600 + 25 * x + rnorm(100, mean=0, sd=(20+x*3))
dat <- data.frame(x=x, y=y)

ggplot(data=dat, aes(x=x, y=y)) + 
  geom_point()


```

## Residual Analysis {.t}
And here the residual plot makes the problem more obvious:
```{r, echo=FALSE, out.width="90%"}
model_dat <- lm(y ~ x, data=dat)
dat$residuals <- model_dat$residuals
ggplot(data=dat, aes(x=x, y=residuals)) + 
  geom_point() + labs(title="Residual Plot")
```



## Residual Analysis {.t}
\framesubtitle{\textbf{E}qual variance across all values of $X$}

What do you think are the consequences of this condition being violated?

https://pollev.com/chi


## Residual Analysis {.t}
\framesubtitle{\textbf{E}qual variance across all values of $X$}

::: {.t}
### How do we check if the parameter estimates are biased?
\vspace{-2mm}
And what do we even mean by "biased?"

It is the difference in a statistic from the true parameter value it is trying to estimate. That is, e.g. for $\beta_1$:
$$
bias = \hat{\beta}_1 - \beta_1
$$
:::

\vspace{5mm}

### Wait but we can't ever know what this is right?
\vspace{-2mm}
True. But, we can get simulated estimates under any given scenario.

## Residual Analysis {.t}
\framesubtitle{\textbf{E}qual variance across all values of $X$}

::: {.t}
### Your Turn \#1
\vspace{-2mm}
Suppose that we have a true relationship of: 
$$
y_i = 600 + 25x_i + \epsilon_i
$$ 

where $\epsilon_i \sim N(0, 5x_i)$. Then,

 - We can simulate data according to this model
 - Get $\hat{\beta}_1$ from the linear regression output
 - Calculate $\hat{\beta}_1 - \beta_1$
 - Do this lots of times to get an estimate of the bias
:::


## Residual Analysis {.t}
\framesubtitle{\textbf{E}qual variance across all values of $X$}

### Your Turn \#1
\vspace{-2mm}

```{r, eval=FALSE}
reps <- 10000
beta_1 <- NULL

for(i in 1:reps){
  y <- ...
  dat <- data.frame(x=x, y=y)
  beta_1[i] <- ...
}

mean(...)

```



## Residual Analysis {.t}
\framesubtitle{Now, how do we use them?}

 - The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}
   - e.g. not quadratic, exponential, etc.

\vspace{2mm}

 - \textbf{I}ndependence of observations

\vspace{2mm}

 - \textbf{N}ormality of $\epsilon_i$
   - Note that this can also be achieved, due to the central limit theorem, with a large sample size even if $\epsilon_i$ does not follow a normal distribution

\vspace{2mm}

 - \textbf{E}qual variance across all values of $X$
   - Also known as homoskedasticity


## Residual Analysis {.t}
\framesubtitle{\textbf{N}ormality of $\epsilon_i$}

```{r, message=FALSE, out.width="70%"}
ggplot(data=march_madness, aes(x=residuals)) + 
  geom_histogram()
```

## Residual Analysis {.t}
\framesubtitle{\textbf{N}ormality of $\epsilon_i$}

### Some Questions
\vspace{-2mm}
 - Again, I'm supposed to just look at it and make a judgment call? What about a definitive answer??
   - For assessing normality, there actually is a statistical test that is often used!
   - Also, there's the QQ-plot

## Residual Analysis {.t}
\framesubtitle{\textbf{N}ormality of $\epsilon_i$}

```{r, out.width="80%"}
ggplot(march_madness, aes(sample=residuals)) + 
  stat_qq() + stat_qq_line()
```
   
## Residual Analysis {.t}
\framesubtitle{\textbf{N}ormality of $\epsilon_i$}
```{r}
shapiro.test(march_madness$residuals)
```

$H_0$ for this test is that the residuals do follow a normal distribution.


## Residual Analysis {.t}
\framesubtitle{Now, how do we use them?}

 - The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}
   - e.g. not quadratic, exponential, etc.

\vspace{2mm}

 - \textbf{I}ndependence of observations

\vspace{2mm}

 - \textbf{N}ormality of $\epsilon_i$
   - Note that this can also be achieved, due to the central limit theorem, with a large sample size even if $\epsilon_i$ does not follow a normal distribution

\vspace{2mm}

 - \textbf{E}qual variance across all values of $X$
   - Also known as homoskedasticity


## Residual Analysis {.t}
\framesubtitle{\textbf{I}ndependence of observations}

 - One way to check this is to see if there is any temporal trend in the data. That is, is there any relationship between datapoints over time?

\vspace{4mm}

 - Note that this is not the ONLY way in which the independence observation can be violated. But it is often the only one that is even possible to check. 

\vspace{4mm}

 - The March Madness dataset is in the order in which students signed up to be in the pool...
 
## Residual Analysis {.t}
\framesubtitle{\textbf{I}ndependence of observations}

```{r, out.width="75%"}
march_madness$order <- 1:dim(march_madness)[1]
ggplot(march_madness, aes(x=order, y=residuals)) + 
  geom_point() + labs(x="order of entry")
```
 
## Residual Analysis {.t}
\framesubtitle{\textbf{I}ndependence of observations}

 - There does not appear to be any discernible pattern in the residuals over time.
 
\vspace{5mm}

 - What might it look like if there \underline{was} a temporal trend?
   - That is, for example, what if those who signed up later tended to have higher scores?
   
   
## Residual Analysis {.t}
\framesubtitle{\textbf{I}ndependence of observations}

```{r, echo=FALSE, out.width="90%"}
set.seed(2)
z <- 1:100
y <- 600 + 25 * x + rnorm(100, 0, 30) + rnorm(100, z*(2/3))
dat <- data.frame(x=x, y=y)

model_nonind <- lm(y ~ x, data=dat)
dat$residuals <- model_nonind$residuals
dat$z <- z

ggplot(data=dat, aes(x=x, y=y)) + 
  geom_point() + labs(x="knowledge", y="score", title="From this plot, you can't tell that anything is wrong...")


```


## Residual Analysis {.t}
\framesubtitle{\textbf{I}ndependence of observations}

```{r, echo=FALSE, out.width="90%"}

ggplot(data=dat, aes(x=z, y=y)) + 
  geom_point() + labs(x="order of entry", y="score", title="Nor can you with this one either")

```


## Residual Analysis {.t}
\framesubtitle{\textbf{I}ndependence of observations}

```{r, echo=FALSE, out.width="90%"}

ggplot(data=dat, aes(x=z, y=residuals)) + 
  geom_point() + labs(x="order of entry", title="But with the residual plot you can see the temporal trend")

```


## Residual Analysis {.t}
\framesubtitle{Your Turn \#2}

In Lab 4, you investigated the Type I Error rate in the presence of non-independence. Let's now investigate what happens to statistical power.

::: {.t}
### Simulation study:
\vspace{-2mm}
 - Simulate 100 samples of $x \sim Unif(0, 10)$
 - Let $t$ represent the order of data collection, and just go from 1 to 100
 - Generate $y = x + 0.5t + \epsilon$ where $\epsilon_i \sim N(\mu=0, \sigma=10)$
 - Run the following linear models:
   - $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$
   - $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \hat{\beta}_2 t$

 - Check whether $H_0\colon \beta_1=0$ is rejected in each case
 - Do this repeatedly (1,000 times) to get an estimate of power in each scenario
 
:::

Comment briefly on what you observe.


## Residual Analysis {.t}
\framesubtitle{Now, how do we use them?}

 - The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}
   - e.g. not quadratic, exponential, etc.

\vspace{2mm}

 - \textbf{I}ndependence of observations

\vspace{2mm}

 - \textbf{N}ormality of $\epsilon_i$
   - Note that this can also be achieved, due to the central limit theorem, with a large sample size even if $\epsilon_i$ does not follow a normal distribution

\vspace{2mm}

 - \textbf{E}qual variance across all values of $X$
   - Also known as homoskedasticity


## Residual Analysis {.t}
\framesubtitle{The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}}

```{r, echo=FALSE, warning=FALSE, message=FALSE}
theme_update(text = element_text(size = 20))
ggplot(data = march_madness, aes(y = bracket_score, x=knowledge)) +
  geom_point() + geom_smooth(method='lm', se=FALSE) +
  xlab("knowledge score") + ylab("bracket score") + 
  ggtitle("It looks fairly reasonable...")
```


## Residual Analysis {.t}
\framesubtitle{The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}}

```{r, out.width="75%"}
march_madness$fitted <- model1$fitted.values
ggplot(data=march_madness, aes(x=fitted, y=residuals)) + 
  geom_point() + labs(x="fitted values")
```

## Residual Analysis {.t}
\framesubtitle{The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}}

 - There does not appear to be much of a pattern here

\vspace{4mm}

 - What might it look like if the relationship was NOT linear?
   - For example, a quadratic relationship
   
## Residual Analysis {.t}
\framesubtitle{The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}}

```{r, echo=FALSE, out.width="90%"}
set.seed(1)
y <- 600 + 4 * x^2 + rnorm(100, 0, 45)
dat <- data.frame(x=x, y=y)

model_nonind <- lm(y ~ x, data=dat)
dat$residuals <- model_nonind$residuals
dat$fitted <- model_nonind$fitted.values

ggplot(data=dat, aes(x=x, y=y)) + 
  geom_point() + labs(x="knowledge", y="score", title="From this plot, it is not that obvious that anything is wrong...")


```
   
   
## Residual Analysis {.t}
\framesubtitle{The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}}

```{r, echo=FALSE, out.width="90%"}

ggplot(data=dat, aes(x=fitted, y=residuals)) + 
  geom_point() + labs(x="fitted value", title="The residual plot makes it a bit more obvious")

```


## Recap {.t}
\framesubtitle{A complete residual analysis would consist of all of the following:}

 - The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}
   - **Diagnostic: residuals vs. fitted values**

\vspace{1mm}

 - \textbf{I}ndependence of observations
   - **Diagnostic: residuals vs. order of entry into study**
   - Reminder: this is only ONE possible dependence structure!!

\vspace{1mm}

 - \textbf{N}ormality of $\epsilon_i$
   - **Diagnostics:**
     - **Histogram of residuals**
     - **QQ-plot**
     - **Shapiro-Wilk test**
   - (do all three of these!)

\vspace{1mm}

 - \textbf{E}qual variance across all values of $X$
   - **Diagnostic: residuals vs. x values**


## Recap {.t}

### Today's Daily Check
\vspace{-2mm}
Both of the Your Turns.