```{r setup, include=FALSE}
  knitr::opts_chunk$set(echo = TRUE)
  knitr::opts_chunk$set(dev = 'pdf')
  def.chunk.hook  <- knitr::knit_hooks$get("chunk")
  knitr::knit_hooks$set(chunk = function(x, options) {
    x <- def.chunk.hook(x, options)
    ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
  })
```

## Who am I?


::: {.block}
### To summarize:
\vspace{-2mm}

 - My training and experience is heavily on the \underline{statistical} side of data science 

\vspace{4mm}

 - I have many years of teaching experience at the college level, specifically statistics courses. But,
   - Prior to UCSB (in 2024), my class sizes were always much smaller (30ish max, and often single digits)
   - This specific class, DSC 152, is brand new, both to UCSD and to me!
:::

\vspace{3mm}

P.S. Call me Peter!


## Who are the rest of the course staff?


## Who are you? {.t}
With any internet-capable device (including laptop, phone, tablet), please go to:

::: {.block} 
### Poll Everywhere
\vspace{-2mm}
https://pollev.com/chi
:::
		
Hit "Skip for now" if it asks you to register for credit (I will not be tracking responses for credit).

\vspace{3mm}

However, please do enter your name as your screen name if you are comfortable doing so (I will be the only one who will see it).


## What is this course about? {.t}

At this point of your academic careers, you have taken:

::: {.block} 
### DSC 10: Principles of Data Science
\vspace{-2mm}
 - introductory hypothesis testing
 - permutation testing
 - bootstrapping
:::

::: {.block} 
### DSC 20: Programming and Data Structures
\vspace{-2mm}
 - lots of coding, but no stats here
:::

::: {.block} 
### DSC 30: Data Structures
\vspace{-2mm}
 - lots of coding, but no stats here
:::


::: {.block} 
### DSC 40A: Theoretical Foundations of Data Science
\vspace{-2mm}
 - regression!
:::

## What is this course about? {.t}

At this point of your academic careers, you have taken:   
   

::: {.block} 
### DSC 80: Practice and Application of Data Science
\vspace{-2mm}
 - messy/missing data
 - more hypothesis testing
   - again with permutation testing and bootstrap (like in DSC 10)
:::

::: {.block} 
### SE 125 or ECE 109 or ECON 120A or MAE 108 or MATH 180A or MATH 183 or MATH 186
\vspace{-2mm}
 - statistical inference
   - probably with closed-form null distributions (i.e. z-test, t-test, etc)
:::

\vspace{5mm}

So... how does DSC 152 (this class) fit in with all this?


## What is this course about? {.t}
The focus will be on statistical data analysis and inference, leveraging what you already know from past courses. Specifically:

::: {.block} 
### What do we even mean by statistical inference?
\vspace{-2mm}
 - Hypothesis testing 
   - p-values
 - Confidence intervals
:::

\vspace{4mm}

::: {.block} 
### And we'll use the stuff you already know and build on that
\vspace{-2mm}
 - You've learned a lot about regression in previous classes (i.e. DSC 40A, but also any AI or ML course you've taken)
 - But, likely without much emphasis on doing statistical inference in the regression setting
   - In a nutshell, this will be our focus: how do we properly do statistical inference in a variety of regression settings?
:::


## Let's go back to basics and build up from there {.t}
Example: flipping a coin

![](coinflip.jpg){height="75%"}

## Let's go back to basics and build up from there {.t}
To do a hypothesis test, we need to lay out our null and alternative hypotheses:


## Null distribution {.t}
In DSC 10 and DSC 80, we learned about *simulated* null distributions:

```{r, echo=FALSE, warning=FALSE}
library(reticulate)
```


```{python, echo=FALSE}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams.update({
    "font.size": 25,
    "axes.titlesize": 22,
    "axes.labelsize": 20,
    "xtick.labelsize": 20,
    "ytick.labelsize": 20,
    "legend.fontsize": 12,
})
```

```{python}
heads_array = np.array([])

for i in np.arange(10000):
    
    # Flip fair coin 6 times and count the number of Heads
    num_heads = np.random.multinomial(6, [0.5, 0.5])[0]
    
    # Add the number of heads seen to heads_array.
    heads_array = np.append(heads_array, num_heads)
  
heads_array
```


## Null distribution {.t}
In DSC 10 and DSC 80, we learned about *simulated* null distributions:

```{python, size="scriptsize", out.width="75%"}
(pd.DataFrame().assign(num_heads=heads_array).plot(kind='hist', 
  density=True, bins=np.arange(-0.5, 6.6, 1), ec='w', legend=False, 
  title = 'Distribution of the number of heads in 6 coin flips')
);
```

## Null distribution {.t}
First new thing: in this class, we are going to be using R exclusively. 

::: {.block} 
### Wait what? Why?
\vspace{-2mm}
 - As data scientists, it is useful to know both Python and R

\pause

 - Specifically, for more statistical tasks within data science, R is often a more convenient choice
 
\pause 

 - Even if, in your eventual career as a data scientist, you don't need to code in R yourself, you will very possibly be working on a team where someone else does. You at least need to be able to read their code (or know when AI translates it incorrectly)
:::

\pause
I will continue to show you the Python equivalent here and there to help you understand what we're trying to do in R, but all of your assignments must be done in R. 

\pause

Lab 1 and the first discussion section will get you up to speed on using R, and R Markdown. But let's start looking at some R code now.

## Null distribution {.t}
\framesubtitle{The R equivalent of the Python code from a few slides ago:}
```{r, size="footnotesize", cache=TRUE}
heads_array <- NULL

for(i in 1:10000){
    # Flip fair coin 6 times and count the number of Heads
    num_heads <- rbinom(n=1, size=6, prob=0.5)
    
    # Add the number of heads seen to heads_array.
    heads_array <- c(heads_array, num_heads)
}
```

```{r, size="footnotesize"}
length(heads_array) # make sure the result has 10,000 elements
heads_array[1:10] # show the first 10 iterations
```


## Null distribution {.t}
Actually, while that works, we don't even need to write a loop to do this:
```{r, size="small"}
heads_array <- rbinom(n=10000, size=6, prob=0.5)
```

```{r, size="small"}
length(heads_array) # make sure the result has 10,000 elements
heads_array[1:10] # show the first 10 iterations
```

## Null distribution {.t}
```{r, echo=FALSE, warning=FALSE}
library(ggplot2)
```

```{r, out.width="75%"}
heads_df <- data.frame(heads = heads_array)
ggplot(data = heads_df, aes(x=heads)) + 
  geom_bar() + theme(text = element_text(size = 20))
```


## Now what? {.t}

::: {.block}
### What is a p-value?
\vspace{-2mm}
https://pollev.com/chi
:::

\pause
\vspace{3mm}

So then, what is the p-value here?

::: {.block}
### What is the p-value here?
\vspace{-2mm}
https://pollev.com/chi again
:::


## Now what? {.t}

::: {.block}
### What is a p-value?
\vspace{-2mm}
https://pollev.com/chi
:::


So then, what is the p-value here?

We can also do it this way:

```{r}
p_val <- sum(heads_array == 6 | heads_array == 0) / 10000
p_val
```

Note that this is a *simulated* p-value, which can only be an approximation of the theoretical p-value. The answer to the previous PollEverywhere question was the theoretical p-value!

## Binomial distribution {.t}
Flipping a coin and observing the number of heads is an example of the **Binomial Distribution** (which you sort of saw in DSC 40A, but definitely in MATH 180A, MATH 183, etc):

::: {.block}
### Probability Mass Function of the Binomial Distribution
\vspace{-2mm}
$$
P(X=x) = {n \choose x} p^x (1-p)^{n-x}
$$
:::

\pause
So,
\vspace{-2mm}
$$
\begin{aligned}
P(X=0) &= {6 \choose 0} 0.5^0 (1-0.5)^{6-0} = 0.5^6 = 0.015625 \\
P(X=6) &= {6 \choose 6} 0.5^6 (1-0.5)^{6-6} = 0.5^6 = 0.015625
\end{aligned}
$$

## Theoretical p-value
$$
\begin{aligned}
P(X=0) &= {6 \choose 0} 0.5^0 (1-0.5)^{6-0} = 0.5^6 = 0.015625 \\
P(X=6) &= {6 \choose 6} 0.5^6 (1-0.5)^{6-6} = 0.5^6 = 0.015625
\end{aligned}
$$

Or in R:
```{r}
dbinom(x=0, size=6, prob=0.5)
dbinom(x=6, size=6, prob=0.5)
```


## Theoretical p-value

::: {.block}
### and so the theoretical p-value is:

```{r}
exact_p_val <- dbinom(x=0, size=6, prob=0.5) + 
               dbinom(x=6, size=6, prob=0.5)
exact_p_val
```
:::


::: {.block}
### compared to the simulated p-value:
```{r}
p_val
```
:::

So either way, we find that there is a fairly low probability of observing something at least as extreme as we did, if $H_0$ is true. 

## What's the point? {.t}

::: {.block}
### Simulated p-values as you learned in DSC 10 and 80 are good, but:
\vspace{-2mm}
 - They are more computationally expensive (not a big burden in this example, but in more complex cases it can matter)
 - They (may) have lower *statistical power* than theoretical p-values (though this is often fairly minor in practice)
   - (what do we mean by statistical power? we'll talk in lots more depth about that)
:::

::: {.block}
### Advantages of simulated p-values
\vspace{-2mm}
 - They require fewer distributional conditions on your data (and often none whatsoever)
 - They may have better Type I Error rates, if the distributional conditions of a theoretical test are not met
 - They do not require you to know much math (only coding)
:::

Both have their uses in practice, and in this class, we will use both!


## Type I Errors {.t}
Reminder: what's a Type I Error? https://pollev.com/chi

\pause

\vspace{3mm}
### Rejection Region
\vspace{-2mm}
To determine the Type I Error of a test, we first must define our \underline{rejection region}. 

\vspace{4mm}

That is, the values that are *extreme enough* to lead us to reject $H_0$.

\vspace{2mm}

 - Suppose in our present example, we use a rejection region of $X \in \{0, 1, 5, 6\}$.

\vspace{2mm}
 - This means that if we observe 0, 1, 5 or 6 heads out of 6 flips, we will reject $H_0$ and conclude that there is significant evidence against the coin being fair. *Does this seem reasonable??*
   - Specifically, what is the probability that we incorrectly reject $H_0$ with this rejection region?
   
   
## Type I Errors {.t}
Then, the \underline{probability of a Type I Error} here would be the probability that we flip 0, 1, 5 or 6 heads *if the coin is actually fair*.

\pause
What is this equal to?

\pause
P(X=0) + P(X=1) + P(X=5) + P(X=6) with a fair coin...
```{r, size="scriptsize"}
TypeI <- dbinom(x=0, size=6, prob=0.5) + dbinom(x=1, size=6, prob=0.5) + 
  dbinom(x=5, size=6, prob=0.5) + dbinom(x=6, size=6, prob=0.5)
TypeI
```

\pause
So if we had decided to use $\{0, 1, 5, 6\}$ as our rejection region, then there would be a `r TypeI*100`% chance of making a Type I Error. 

... that's pretty high!

## Type I Errors {.t}

::: {.block}
### $\alpha$-level test
\vspace{-2mm}
A statistical test with an $\alpha$ probability of making a Type I Error is referred to as an $\alpha$-level test.
\vspace{3mm}

So the example on the previous slide would be a `r TypeI`-level test. 

\vspace{3mm}
The most commonly used value for $\alpha$ is 0.05. So in that case, we would be doing a 0.05-level test. 
:::

But often, $\alpha$ is only an approximation, and/or relies on distributional conditions to be correct!


\vspace{4mm}
Next time: what exactly do we mean by that?

## To-dos

 - If you do not already have R and RStudio installed on your machine, get them installed!
   - https://posit.co/download/rstudio-desktop/

\vspace{3mm}

 - Tomorrow's discussion section will be an introduction to R and R Markdown. If you are new to either of these, please plan to attend.
   - You may attend either the 3pm or the 4pm section (both are in Center Hall 216).

\vspace{3mm}
   
 - Read the course syllabus and complete the Syllabus Check by tomorrow 4/1 at midnight
 
\vspace{3mm}

 - Complete the Welcome Survey by tomorrow 4/1 at midnight
 
\vspace{3mm}

 - Complete today's Daily Check by **tonight at midnight**