```{r setup, include=FALSE}
  knitr::opts_chunk$set(echo = TRUE)
  knitr::opts_chunk$set(dev = 'pdf')
  def.chunk.hook  <- knitr::knit_hooks$get("chunk")
  knitr::knit_hooks$set(chunk = function(x, options) {
    x <- def.chunk.hook(x, options)
    ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
  })
```

## Last time {.t}

::: {.t}
### Example 1: Smoking and FEV
\vspace{-2mm}
 - The primary covariate of interest was binary (smoking status)
 - The other covariate (the potential confounder) was quantitative (age)
:::

\vspace{3mm}

::: {.t}
### Example 2: Penguin Beak Dimensions
\vspace{-2mm}
 - The primary covariate of interest was quantitative (beak length)
 - The other covariate (the potential confounder) was categorical (species)
:::

   
In neither case did we have to worry about how to do inference (that is, get a p-value) for a categorical variable (with more than 2 categories). 


## Example: Mental Health Dogs {.t} 
\vspace{-1mm}
![](caa.png)

## Example: Mental Health Dogs {.t} 

::: {.t}
### Study setup
\vspace{-2mm}
There are three treatment groups:

 - Control group
   - Go into treatment room, but there is no dog (only the handler)

\vspace{4mm}

 - Indirect contact with therapy dog
   - Go into the treatment room, and there is a dog and its handler (but cannot pet the dog)

\vspace{4mm}

 - Direct contact with therapy dog
   - Go into the treatment room, and there is a dog and its handler, can pet the dog
:::

The outcome of interest is a post-pre difference in a loneliness score (negative values indicate that the person's
loneliness went DOWN after the treatment, which is what we want!) 

## Example: Mental Health Dogs {.t} 
```{r, warning=FALSE, echo=FALSE, message=FALSE, cache=TRUE}
library(ggplot2)
library(dplyr)

# import the cleaned data

lonely <- read_csv("dog_data_lonely.csv")

ggplot(data=lonely, mapping=aes(x=Group, y=diff)) + 
  geom_boxplot()
```


## Example: Mental Health Dogs {.t} 
\label{hyp}
\framesubtitle{Now, how do we analyze these data statistically?}

::: {.t}
### We have:
\vspace{-2mm}

 - A quantitative outcome variable (loneliness score)
 - A categorical primary covariate (treatment group)
:::

What's our null hypothesis?? And our alternative hypothesis??

## Example: Mental Health Dogs {.t} 
\label{dummy}
So, we can still run a linear model:

```{r, size="scriptsize"}
groups_model <- lm(diff ~ Group, data=lonely)
summary(groups_model)
```

## Example: Mental Health Dogs {.t} 
But wait, what model did this just fit, and what p-value(s) do we care about?

$$
\widehat{\texttt{diff}} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \texttt{Direct} + \hat{\beta}_2 \cdot \texttt{Indirect}
$$

where

 - `Direct` is equal to:
   - 1 if the person was in the Direct treatment group
   - 0 if they were not
   
 - `Indirect` is equal to:
   - 1 if the person was in the Indirect treatment group
   - 0 if they were not
   

   
## Example: Mental Health Dogs {.t} 
\label{dc1}

### Daily Check Question \#1 (in an R Markdown document)
\vspace{-2mm}
According to the model output on Slide \ref{dummy}, what are the estimated means of each treatment group?

\vspace{3mm}

First answer it at pollev.com, and then put in your Rmd file once you know that you are right. 


## Example: Mental Health Dogs {.t} 
Now, as tempting as it might be to think otherwise, we do not ONLY care about the impact of being in the `Direct` group.

Recall what our null and alternative hypotheses were on Slide \ref{hyp}. What do those translate to in terms of the $\beta$s?



## Example: Mental Health Dogs {.t} 
```{r, size="footnotesize"}
null_model <- lm(diff ~ 1, data=lonely)
anova(null_model, groups_model)
```

How does this work?

## Example: Mental Health Dogs {.t} 

::: {.t}
### $RSS$ has the same definition that it did from Lecture 7:

$$
RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$
:::

(which, again, may have been called $R_{sq}(w_0, w_1)$ in DSC 40A)

The `null_model` from the previous slide, in this case, is the model with only the intercept: $\hat{y} = \hat{\beta}_0$.

\vspace{5mm}

...what exactly does this indicate $\hat{y}$ to be?

## Example: Mental Health Dogs {.t} 
\label{yt1}
\framesubtitle{Your Turn \#1 (for Daily Check)}

In an R Markdown file,

 - Load in the data, `dog_data_lonely.csv`.
 
\vspace{2mm}

 - Manually code a calculation of RSS for the null model, based on what we said $\hat{y}$ is from the previous slide

\vspace{2mm}

 - Then, run the `groups_model` from Slide \ref{dummy} and store the regression output
   - Use the coefficient estimates from the model to manually calculate RSS for this model
   
\vspace{2mm}

 - Comment briefly on what you observe, and how it compares to the values in the `anova` function output
 
## Example: Mental Health Dogs {.t} 
\label{dc2}

### Now, where exactly does the p-value come from?
\vspace{-2mm}
If $H_0$ is actually true, what do we expect to be the case regarding $RSS_{full}$ vs. $RSS_{null}$?

\vspace{2mm}
pollev.com

### Daily Check Question
\vspace{-2mm}
Explain each of these answers. 

## Example: Mental Health Dogs {.t} 
More space for notes if needed


## Example: Mental Health Dogs {.t} 
```{r, size="footnotesize", echo=FALSE}
anova(null_model, groups_model)
```

\pause

::: {.t}
### The test statistic for this output is:
\vspace{-4mm}
$$
\mathcal{F} = \frac{(RSS_{null} - RSS_{full})/p}{RSS_{full} / (n-k)}
$$
:::

where
\vspace{-3mm}

 - $n$ is the total sample size
 - $k$ is the total number of coefficients in the full model
 - $p$ is the difference in the number of coefficients between the two models
 
## Example: Mental Health Dogs {.t} 
```{r, size="footnotesize", echo=FALSE}
anova(null_model, groups_model)

RSS_full <- sum((lonely$diff - (groups_model$coefficients[1] + 
  groups_model$coefficients[2] * (lonely$Group == "Direct") + 
  groups_model$coefficients[3] * (lonely$Group == "Indirect")))^2)

RSS_null <- sum((lonely$diff - mean(lonely$diff))^2)

```

::: {.t}
### Now take the RSS values that we calculated in Your Turn \#1:
\vspace{-2mm}
Do the calculation of the F statistic with R code, and add it to the end of Your Turn \#1.

:::

\vspace{4mm}


This test is called a partial-$\mathcal{F}$ test, and its null distribution is the $\mathcal{F}$ distribution...

## $\mathcal{F}$ distribution

```{r, echo=FALSE}
f <- seq(0.01, 20, by=0.01)
pf <- df(f, df1=2, df2=281)

plot(pf ~ f, type='l', ylab="density", main="Null distribution for Partial F test")
```

\vspace{-4mm}
Notice that `r round(anova(null_model, groups_model)[2,5], 3)` is quite extreme...

## Example: Mental Health Dogs {.t} 
So, we get a tiny p-value and we reject $H_0$.

::: {.t}
### Next Question: adding covariates
\vspace{-2mm}

What if we want to investigate the impact of the treatment group, while adjusting for age?

:::

That is, we just fit this model:

$$
\widehat{\texttt{diff}} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \texttt{Direct} + \hat{\beta}_2 \cdot \texttt{Indirect}
$$

but now we want to fit this model:

$$
\widehat{\texttt{diff}} = \hat{\beta}_0 + \hat{\beta}_1 \cdot \texttt{Direct} + \hat{\beta}_2 \cdot \texttt{Indirect} + \hat{\beta}_3 \cdot \texttt{age}
$$

Fitting the model is easy...

## Example: Mental Health Dogs {.t} 

```{r, size="scriptsize"}
groups_age_model <- lm(diff ~ Group + Age_Yrs, data=lonely)
summary(groups_age_model)
```

## Example: Mental Health Dogs {.t} 
But now, again, how do we get the p-value that we actually care about?

\vspace{3mm}

::: {.t}
### Partial-$\mathcal{F}$ test again
\vspace{-2mm}
We want to test the entire categorical variable, meaning that like before, 

$$
H_0 \colon \beta_1 = \beta_2 = 0
$$

:::

If this is our null hypothesis, then what is our null model?

## Example: Mental Health Dogs {.t} 
\label{yt2}

\framesubtitle{Your Turn \#2}

 - Run the null model as described on the previous slide (and store it as something)
 
\vspace{3mm}

 - Run the partial-$\mathcal{F}$ test using the `anova` function as we saw previously
 
\vspace{3mm}

 - Report your p-value and conclusions, at $\alpha=0.0$.
 
## Recap and Looking Ahead 

::: {.t}
### Recap
\vspace{-2mm}

 - If our primary covariate of interest is categorical, then the p-value of interest must be obtained via a partial-$\mathcal{F}$ test
   - This is because the null hypothesis is that all group means are equal to each other
   
\vspace{5mm}

 - The null model for the partial-$\mathcal{F}$ test is the one without that categorical variable in it (but everything else from the full model remains in)
:::


 
## Recap and Looking Ahead {.t} 

::: {.t}
### Summary of today's Daily Check
\vspace{-2mm}

 - Answer to the question on Slide \ref{dc1}
 
 - Your Turn \#1 on Slide \ref{yt1}
 
 - Answer to the question on Slide \ref{dc2}
 
 - Your Turn \#2 on Slide \ref{yt2}

:::

\vspace{6mm}

### Looking Ahead
\vspace{-2mm}
Model Diagnostics!