```{r setup, include=FALSE}
  knitr::opts_chunk$set(echo = TRUE)
  knitr::opts_chunk$set(dev = 'pdf')
  def.chunk.hook  <- knitr::knit_hooks$get("chunk")
  knitr::knit_hooks$set(chunk = function(x, options) {
    x <- def.chunk.hook(x, options)
    ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
  })
```

## What is Simple Linear Regression?

::: {.block}
### In DSC 40A...
\vspace{-2mm}
You likely saw a simple linear regression model written as 

\vspace{-4mm}
$$
H^*(x) = w_0^* + w_1^*x
$$


Statisticians tend to write it as: 

\vspace{-4mm}
$$
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x
$$

:::

\vspace{3mm}

::: {.block}
### Typically, when we think of simple linear regression, we think of:

\vspace{-2mm}
 - A quantitative outcome variable
 - A quantitative predictor variable (or primary covariate)

:::

The "simple" refers to the fact that there is only ONE predictor variable (if there are more, then we are doing multiple linear regression, which we will get to later). 

Then, with two quantitative variables we can plot a scatterplot...

## Example: March Madness knowledge vs. bracket scores
```{r, echo=FALSE, warning=FALSE, message=FALSE}
library(readr)
library(ggplot2)
march_madness <- read_csv("march_madness.csv")

knowledge <- march_madness$knowledge
bracket_score <- march_madness$bracket_score

theme_update(text = element_text(size = 20))
ggplot(data = march_madness, aes(y = bracket_score, x=knowledge)) +
  geom_point() +
  xlab("knowledge score") + ylab("bracket score") + 
  ggtitle("Students from Villanova University in Spring 2019")
```

## Example: March Madness knowledge vs. bracket scores {.t}
\framesubtitle{The data}

1. Students from a class I was teaching in Spring 2019 at Villanova University were asked to fill out March Madness brackets for the NCAA Men's College Basketball Championship (n=50). 

\pause
\vspace{1mm}

2. They also filled out a questionnaire to assess their "basketball knowledge." The questions were things like: 
   - How many men's college basketball games did you watch this season?
   - On how many teams in the tournament can you name at least one player? Two players? The head coach?
   - When watching a basketball game, how often are you able to identify any particular strategies that are being used?

\pause
\vspace{1mm}
   
3. Their responses to the questions were summarized in a "knowledge score" from 0 to 10. 

\pause
\vspace{1mm}

4. Brackets were scored according to default ESPN settings (10 pts for Round 1 picks, 20 pts for Round 2 picks, etc).

## Example: March Madness knowledge vs. bracket scores {.t}
So,

 - $x_i$ is the knowledge score of the $i^{th}$ participant
 - $y_i$ is the bracket score of the $i^{th}$ participant

\vspace{2mm}
```{r, echo=FALSE, warning=FALSE, message=FALSE, out.width="75%", fig.align='center'}
theme_update(text = element_text(size = 23))
ggplot(data = march_madness, aes(y = bracket_score, x=knowledge)) +
  geom_point() +
  xlab("knowledge score") + ylab("bracket score") + 
  ggtitle("Students from Villanova University in Spring 2019")
```


## Least Squares Solutions {.t}
Then, if we want to model their relationship...

::: {.block}
### In DSC 40A you derived the least squares solutions:

\vspace{-3mm}
$$
\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^n (x_i - \overline{x})^2}, \qquad \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}
$$

(how the heck did we get these again?)
:::

\pause 

\center
Quote of the Day:

![](willard.png){width=30%}


## Least Squares Solutions {.t}
Then, if we want to model their relationship...

::: {.block}
### In DSC 40A you derived the least squares solutions:

\vspace{-3mm}
$$
\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^n (x_i - \overline{x})^2}, \qquad \hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}
$$

(how the heck did we get these again?)
:::


In this particular case these would give:
```{r, size="scriptsize"}
beta1 <- sum((knowledge-mean(knowledge))*(bracket_score-mean(bracket_score))) / 
  sum((knowledge-mean(knowledge))^2)
beta1
```

\vspace{-3mm}

and

```{r, size="scriptsize"}
beta0 <- mean(bracket_score) - beta1*mean(knowledge)
beta0
```


## Least Squares Regression Line {.t}

We can of course then draw the line $\hat{y} = `r round(beta0, 2)` + `r round(beta1, 2)`x$ on the scatterplot: 

```{r, echo=FALSE, out.width="80%", warning=FALSE, message=FALSE}
theme_update(text = element_text(size = 23))
ggplot(data = march_madness, aes(y = bracket_score, x=knowledge)) +
  geom_point() + geom_smooth(method='lm', se=FALSE) +
  xlab("knowledge score") + ylab("bracket score") + 
  ggtitle("Students from Villanova University in Spring 2019")
```


## Least Squares Regression Line
The blue line drawn on the scatterplot is the best fitting line to the data, in the sense that the quantity
$$
RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$
is the smallest it could possibly be, with this line as the model.

### Some notes:

 - RSS = residual sum of squares
 - In DSC 40A, you may have seen this as:
 
 $$
 R_{sq}(w_0, w_1) = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_i))^2
 $$
 
But it's ultimately the same thing, and either way the point is that the line that we get by minimizing this quantity is the "best" line that could be fit to these data (according to squared error loss). 


## Sidenote: notation {.t}

 - $y$ vs. $\hat{y}$ 
   - and vs. $H(x)$ from 40A
   
\vspace{8mm}
 - $\beta_i$ vs. $\hat{\beta}_i$ 
   - and vs. $w_i$ and $w_i^*$ from 40A
   
\vspace{8mm}

 - What is a "parameter"?
   - ML definition vs. statistical definition
     - In ML, the word "parameter" is often used to refer to any input value for a model. 
     - This is NOT how we use the word parameter in statistics!

## Sidenote: notation {.t}
\framesubtitle{From DSC 40A, Fall 2024 (and probably most other quarters)}

![](40A_parameters.png)

This usage of the word "parameter" is ok for ML, but is *incorrect* in a statistical context! Why?


## Sidenote: notation {.t}

::: {.block}
### Statistical definition of parameter: 
\vspace{-2mm}
A population quantity that is typically unknown, but is what we are trying to estimate from any given model. 

\vspace{2mm}

This is in contrast to a \underline{statistic}, which is an \underline{estimate} of a parameter that we can calculate from any given dataset.
:::

\vspace{5mm}
::: {.block}
### Examples: 
\vspace{-2mm}

 - $\overline{x}$ is a statistic that estimates the parameter $\mu$
 - $s^2$ is a statistic that estimates the parameter $\sigma^2$
 - $\hat{\beta}_i$ is a statistic that estimates the parameter $\beta_i$
:::


Incidentally, this is how the word "parameter" is also used in DSC 10, and in Justin's notes for DSC 140A. 

## Sidenote: notation {.t}
Why does this distinction matter??

::: {.block}
### One reason: Hypothesis testing
\vspace{-2mm}
In DSC 40A and probably several other classes, you learned a lot about how to do modeling with regression, but likely little to no \underline{statistical inference} for regression. That is, something like:

$$
\begin{aligned}
H_0\colon &\beta_1 = 0 \\
H_A\colon &\beta_1 \neq 0
\end{aligned}
$$

:::

Note carefully that $\beta_1$ does not have a hat on it. Again, why does this matter?

\vspace{3mm}

And what are we even trying to do here and how is it different from modeling?


## Statistical Inference for Regression {.t}

::: {.block}
### Model Building
\vspace{-2mm}
When doing modeling (such as in a machine learning context), the goal is to find the "best" model according to some criteria ($R^2$, RMSE, etc). Then we might use that model to do prediction.

\vspace{3mm}
Example: if I know a March Madness participant's knowledge score, what is their predicted bracket score?

:::

\vspace{5mm}

::: {block}
### Statistical Inference
\vspace{-2mm}
In contrast, when we are doing statistical inference, we have a specific question about the population, and we want to answer it with our sample. 

\vspace{3mm}
Example: is there an association between a March Madness participant's basketball knowledge and their bracket score?
:::

These are two different questions!

## Statistical Inference for Regression {.t}
If there is an association between a participant's basketball knowledge and their bracket score, then from our simple linear model, this would mean that the slope of the regression line is not equal to 0. 

\vspace{3mm}
(Note that this specifically investigates a \underline{linear} association).

\vspace{3mm}
The slope of the regression line is $\beta_1$. Since the question is about the population, there is no hat:

$$
\begin{aligned}
H_0\colon &\beta_1 = 0 \\
H_A\colon &\beta_1 \neq 0
\end{aligned}
$$

The one with the hat, $\hat{\beta}_1$, refers to the actual value that is calculated from the data. We use that to then get our p-value! But how?

## Statistical Inference for Regression {.t}
```{r, size="footnotesize"}
model1 <- lm(bracket_score ~ knowledge, data=march_madness)
summary(model1)
```


## Statistical Inference for Regression {.t}
Sidenote: notice that we get the same values from this output as we did from the previous manual calculation:

\vspace{3mm}


### $\hat{\beta}_1$:
```{r, size="scriptsize"}
beta1 <- sum((knowledge-mean(knowledge))*(bracket_score-mean(bracket_score))) / 
  sum((knowledge-mean(knowledge))^2)
beta1
summary(model1)$coefficients[2,1]
```


## Statistical Inference for Regression {.t}
Sidenote: notice that we get the same values from this output as we did from the previous manual calculation:

\vspace{3mm}

### $\hat{\beta}_0$:
```{r, size="scriptsize"}
beta0 <- mean(bracket_score) - beta1*mean(knowledge)
beta0
summary(model1)$coefficients[1,1]
```


## Statistical Inference for Regression {.t}
From the R output, the p-value for $H_0\colon \beta_1 = 0$ is:

```{r}
summary(model1)$coefficients[2,4]
```

and the heading of `Pr(>|t|)` in the output table suggests that this p-value comes from a t-Test. But why/how?

## Statistical Inference for Regression {.t}

::: {.block}
### Recall: the one-sample t-Test statstic
\vspace{-3mm}
$$
t_s = \frac{\overline{x} - \mu_0}{s / \sqrt{n}}
$$
:::

\vspace{4mm}

::: {.block}
### Now, here is the test statistic for $H_0\colon \beta_1 = 0$:
\vspace{-3mm}
$$
\frac{\hat{\beta}_1}{SE(\hat{\beta}_1)}
$$
:::

 - And recall that for $t_s$ to have a t-Distribution, it relied on $\overline{x}$ having a normal distribution. 
 - So, it follows that here we need $\hat{\beta}_1$ to have a normal
distribution! 


Does it??

## Statistical Inference for Regression {.t}
\framesubtitle{Normality of $\hat{\beta}_1$:}
Yes it does, if we assume normally distributed residuals in our model:

$$
y_i = \beta_0 + \beta_1 x_i + \epsilon_i
$$
where $\epsilon_i \sim N(0, \sigma^2)$. What exactly does this mean?

\vspace{3mm}

::: {.block}
### Normally distributed residuals
\vspace{-2mm}
The assumption of normally distributed residuals means that each $y_i$ value deviates from the model by an amount that follows a normal distribution, centered at 0. 
:::

It follows somewhat intuitively that if $\epsilon_i$ follows a normal distribution then $\hat{\beta}_1$ also will, but a rigorous proof of that is beyond the scope of this course. 

## Conditions for Validity {.t}
These are the required conditions for statistical inference in the context of a linear model to be valid:


 - The relationship between $X$ and $Y$, if there is one, is actually \underline{linear}
   - e.g. not quadratic, exponential, etc.

\vspace{2mm}

 - Independence of observations

\vspace{2mm}

 - Normality of $\hat{\beta}_1$
   - Note that this can also be achieved, due to the central limit theorem, with a large sample size even if $\epsilon_i$ does not follow a normal distribution

\vspace{2mm}

 - Equal variance across all values of $X$
   - Also known as homoskedasticity


## Conditions for Validity {.t}
These are the required conditions for statistical inference in the context of a linear model to be valid:


 - The relationship between $X$ and $Y$, if there is one, is actually \underline{\textbf{L}inear}
   - e.g. not quadratic, exponential, etc.

\vspace{2mm}

 - \textbf{I}ndependence of observations

\vspace{2mm}

 - \textbf{N}ormality of $\hat{\beta}_1$
   - Note that this can also be achieved, due to the central limit theorem, with a large sample size even if $\epsilon_i$ does not follow a normal distribution

\vspace{2mm}

 - \textbf{E}qual variance across all values of $X$
   - Also known as homoskedasticity


## Conditions for Validity {.t}
As we have said, "validity" of a statistical test means that its Type I Error rate will be equal to its nominal significance level of $\alpha$ (typically 0.05).

You will investigate Type I Error rates with in the presence of violations to these conditions on Lab 4

\vspace{4mm}

::: {.t}
### In class today, we will instead investigate statistical power
\vspace{-2mm}
Recall: 

 - Effect Size
 - Variance
 - Sample Size
 - Significance Level

:::

First: what is an effect size in the context of a linear model?

## Statistical Power {.t}
\framesubtitle{The Effect Size}

```{r, echo=FALSE, warning=FALSE, message=FALSE}
theme_update(text = element_text(size = 23))
ggplot(data = march_madness, aes(y = bracket_score, x=knowledge)) +
  geom_point() + geom_smooth(method='lm', se=FALSE) +
  xlab("knowledge score") + ylab("bracket score") + 
  ggtitle("Students from Villanova University in Spring 2019")
```

## Statistical Power {.t}
\framesubtitle{The Effect Size}

 - $\hat{\beta}_0 \approx `r round(model1$coefficients[1], 4)`$; what is its interpretation?

\vspace{15mm}
 - $\hat{\beta}_1 \approx `r round(model1$coefficients[2], 4)`$; what is its interpretation?


## Statistical Power {.t}
We will take a simulation approach to power estimation in the linear model setting, as we are getting into a realm where ready-made routines are either:

 - difficult to find
 - difficult to use/understand what they do
 
So, what do we simulate?


### Effect size and variance
\vspace{-2mm}
As a starting point, consider:

 - a linear increase by 20 points,
 - per increase in knowledge score by 1 point 
 
What about the variance?


## Statistical Power {.t}
The $X$ values (knowledge score) can be randomly or deterministically generated; it doesn't make much of a difference as long as there is decent coverage along the range of interest:
```{r}
march_madness$knowledge <- runif(50, min=0, max=10)
```

The, the $Y$ values are generated as a function of the $X$ values:

```{r}
march_madness$bracket_score <- 623 + march_madness$knowledge * 20
```

 - The intercept of 623 does not matter at all; we could completely omit it if we wanted to
 - We're not done yet; we need to add some noise. But first, let's look at the simulated data we just generated:
 
## Statistical Power {.t}
```{r, echo=FALSE, warning=FALSE, message=FALSE}
theme_update(text = element_text(size = 23))
ggplot(data = march_madness, aes(y = bracket_score, x=knowledge)) +
  geom_point() + geom_smooth(method='lm', se=FALSE) +
  xlab("knowledge score") + ylab("bracket score") + 
  ggtitle("Students from Villanova University in Spring 2019")
```

## Statistical Power {.t}
Now, how much noise should we add?

We can get an idea of how much variability is present in the system by looking at the graph of our actual data (note that this is all getting very dangerously into the territory of "post-hoc" power calculations but alas...)

\pause

From inspection of the graph, it looks like $\sigma=150$ might be reasonable (based on 68\% of observations falling within 1 standard deviation)

## Statistical Power {.t}
```{r, size="scriptsize"}
march_madness$bracket_score <- 623 + march_madness$knowledge * 20 + 
  rnorm(50, 0, 150)
```

```{r, echo=FALSE, warning=FALSE, message=FALSE, out.width="85%"}
theme_update(text = element_text(size = 23))
ggplot(data = march_madness, aes(y = bracket_score, x=knowledge)) +
  geom_point() + geom_smooth(method='lm', se=FALSE) +
  xlab("knowledge score") + ylab("bracket score") + 
  ggtitle("Simulated data")
```

## Statistical Power {.t}
Now, how do we estimate power?

::: {.t}
### Reminder: what is statistical power?
\vspace{-2mm}
Statistical power is the probability of correctly rejecting $H_0$.

:::

So, here if $\beta_1=20$, then we \underline{should} reject $H_0$.

## Statistical Power {.t}
\framesubtitle{Your Turn}
Now, how do we estimate power?

 - Simulate samples of size $n=50$ under $\beta_1=20, \epsilon \sim N(0, \sigma=150)$.
 
\vspace{4mm}

 - Run the statistical test for $H_0\colon \beta_1=0$ on these data
 
\vspace{4mm}

 - Determine whether $p<0.05$
 
\vspace{4mm}

 - Repeat many times, count the proportion of the time that $H_0$ is rejected
 
## Recap and Looking Ahead

::: {.t}
### Today's Daily Check
\vspace{-2mm}
Just the Your Turn on the previous slide
:::

\vspace{3mm}

:::{.t}
### Recap
\vspace{-2mm}
 - Simple Linear Regression refers to the scenario in which we have one quantitative outcome variable and one quantitative predictor variable
 - Statistical inference in this setting asks questions about $\beta_1$, the slope
 - Conditions for validity are summarized by LINE
:::

\vspace{3mm}

### Looking Ahead
\vspace{-2mm}
Multiple Linear Regression!

## Quiz Review