```{r setup, include=FALSE}
  knitr::opts_chunk$set(echo = TRUE)
  knitr::opts_chunk$set(dev = 'pdf')
  def.chunk.hook  <- knitr::knit_hooks$get("chunk")
  knitr::knit_hooks$set(chunk = function(x, options) {
    x <- def.chunk.hook(x, options)
    ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
  })
```


## What is an A/B test? {.t}
A/B testing was briefly mentioned in DSC 10, though you have likely heard this term elsewhere too. 

\vspace{2mm}

::: {.t}
### Designed experiments
\vspace{-2mm}

 - "A/B testing" is basically just the data science / business analytics communities' term for a designed experiment.
 
\vspace{2mm}

 - The "A/B" in the name of it refers to having an "A group" and a "B group" that you want to make comparisons between.
 
\vspace{2mm}

 - But, also you could have more groups, and more factors...
:::

We will only cover the basics here and then will focus on the relevant statistical inference aspects; most of the nuts and bolts of more complex A/B testing are beyond the scope of this course. 


## Example: Coding vs. Handwritten Exercises {.t}

You all were invited to participate in my research study on teaching probability theory.

\vspace{2mm}

::: {.t}
### Overall question
\vspace{-2mm}
Does doing coding exercises have a different impact than doing handwritten exercises on students' resulting understanding of probability theory?

:::

\vspace{-10mm}

```{r, echo=FALSE}
plot.new()
plot.window(xlim = c(0, 10), ylim = c(0, 10)) 

# Write pre-test at the top and put a box around it
pretest <- "Pre-test"
xt <- 4.5
yt <- 9.5
text(x=xt, y=yt, labels=pretest, cex=2)

sw   <- strwidth(pretest)*2
sh   <- strheight(pretest)*2
frsz <- 0.2

rect(
  xt - sw/2 - frsz,
  yt - sh/2 - frsz,
  xt + sw/2 + frsz,
  yt + sh/2 + frsz
)

# draw arrows to each group
arrows(x0=4, y0=8.75, x1=2, y1=6)
arrows(x0=5, y0=8.75, x1=7, y1=6)

# initiate text for groups
handwrit <- "1) Handwritten Exercises"
coding <- "1) Coding Exercises"
posttest <- "2) Post-test"


# Write text and box for handwritten
xt <- 2.2
yt <- 5.3
text(x=xt, y=yt, labels=handwrit, cex=2)
text(x=xt, y=(yt-0.75), labels=posttest, cex=2)

sw   <- strwidth(handwrit)*2
sh   <- strheight(handwrit)*4
frsz <- 0.3

rect(
  xt - sw/2 - frsz,
  yt - (0.75/2) - sh/2 - frsz,
  xt + sw/2 + frsz,
  yt - (0.75/2) + sh/2 + frsz
)

# Write text and box for coding
xt <- 7.5
yt <- 5.3
text(x=xt, y=yt, labels=coding, cex=2)
text(x=xt, y=(yt-0.75), labels=posttest, cex=2)

sw   <- strwidth(coding)*2
sh   <- strheight(coding)*4
frsz <- 0.3

rect(
  xt - sw/2 - frsz,
  yt - (0.75/2) - sh/2 - frsz,
  xt + sw/2 + frsz,
  yt - (0.75/2) + sh/2 + frsz
)


```


## Example: Coding vs. Handwritten Exercises {.t}

::: {.t}
### Study design
\vspace{-2mm}
 - All participants take the pre-test.

\vspace{0.5mm}

 - Participants then get randomly assigned to either:
   - Do the handwritten exercises
   - Do the coding exercises
   
\vspace{0.5mm}

 - Participants then take the post-test.
 
\vspace{0.5mm}

 - We record (post-test score - pre-test score) for each participant.
 
:::

\pause 

Statistical hypotheses:
$$
\begin{aligned}
H_0\colon \mu_1 = \mu_2 \\
H_A\colon \mu_1 \neq \mu_2
\end{aligned}
$$

 - $\mu_1$ is the true mean post-test/pre-test difference among those who did the handwritten exercises
 - $\mu_2$ is the true mean post-test/pre-test difference among those who did the coding exercises
 
 
## Example: Coding vs. Handwritten Exercises {.t}

::: {.t}
### Study design
\vspace{-2mm}
 - All participants take the pre-test.

\vspace{2mm}

 - **Participants then get randomly assigned to either:**
   - **Do the handwritten exercises**
   - **Do the coding exercises**
   
\vspace{2mm}

 - Participants then take the post-test.
 
\vspace{2mm}

 - We record (post-test score - pre-test score) for each participant.
 
:::

This \underline{random assignment} is the key to being able to infer causation. Why?


## Example: Coding vs. Handwritten Exercises {.t}
More space for notes if needed


## Example: Coding vs. Handwritten Exercises {.t}
We are collecting the real data now so we don't have any to show yet. Let us consider the following small set of preliminary, fictional data:

```{r, echo=FALSE}
handwritten <- c(2, 5, 3, -3, 8)
coding <- c(9, 10, -1, 14, 6)

prob_df <- data.frame(score_diff = c(handwritten, coding),
                      group = c(rep("handwritten", 5), rep("coding", 5)))

knitr::kable(prob_df)
```

## Example: Coding vs. Handwritten Exercises {.t}
```{r, warning=FALSE, message=FALSE}
library(dplyr)
prob_summary <- prob_df %>% 
  group_by(group) %>% 
  summarize(mean=mean(score_diff), sd=sd(score_diff), n=n())

knitr::kable(prob_summary)
```


## Example: Coding vs. Handwritten Exercises {.t}
\framesubtitle{Let's visualize it}
::: {.block}
### Quote of the Day \#1 
\vspace{-2mm}
Don't be a moron; *look* at your data
:::

\begin{minipage}{0.8\textwidth}
\begin{flushright}
Dr. Jon Wakefield \\
Professor of Statistics \\
and Biostatistics \\
University of Washington
\end{flushright}
\end{minipage}
\begin{minipage}{0.19\textwidth}
\includegraphics[width=\textwidth]{wakefield.png}
\end{minipage}

\vspace{2mm}

::: {.block}
### Quote of the Day \#2
\includegraphics[width=\textwidth]{wei_inspect.png}
:::


## Example: Coding vs. Handwritten Exercises {.t}
\label{yt1}
\framesubtitle{Your Turn \#1}
Let's make an appropriate data viz for these data. What might that be?

\vspace{5mm}

Here is code to generate the dataframe that you may copy-paste or type into your Rmd file:
```{r, size="small", eval=FALSE}
hand_code_df <- data.frame(score_diff = c(2, 5, 3, -3, 8, 
                                          9, 10, -1, 14, 6),
                           group = c(rep("handwritten", 5), 
                                     rep("coding", 5)))
```

Make the graph in your Rmd file along with brief comments on what you observe. 


## Example: Coding vs. Handwritten Exercises {.t}
Then, the standard statistical test for an A/B test is just the two-sample t-Test:

```{r, size="small"}
t.test(score_diff ~ group, data=prob_df)
```

## Example: Coding vs. Handwritten Exercises {.t}
\label{validity}
What are the conditions required for validity of a two-sample t-Test?

https://pollev.com/chi

\pause


\vspace{10mm}
::: {.t}
### And as a reminder, what do we even mean by "validity"?
\vspace{20mm}


:::


\vspace{5mm}

(the answer to this will go in your R Markdown file for today's Daily Check)


## The Permutation Test {.t}
You have seen the permutation test in DSC 10 and DSC 80, so this is just a quick reminder.

https://www.rossmanchance.com/applets/

\pause

\vspace{2mm}

::: {.t}
### two-sample t-Test vs. permutation test
\vspace{-2mm}
The permutation test is a non-parametric version of the two-sample t-Test

 - It does not require any distributional conditions whatsoever
 - It actually CAN test the same thing as a two-sample t-Test
 - And it generally does quite well!

:::


That is, recall the non-parametric alternatives in the one-sample case:

 - The sign test is actually a test of the median instead of the mean
 - The bootstrap hypothesis test didn't work very well
 
 
## The Permutation Test {.t}
Now, how do we code a permutation test in R?

 - First, write a function to calculate the statistic of interest on the dataframe of interest (here, the absolute value of the difference in means will work)

\vspace{2mm}

 - Then, in a loop, we need to shuffle either the group variable or the outcome variable (either is fine)
 
\vspace{2mm}

 - Store these statistics into a vector using the R equivalent of the accumulator pattern -- this is the null distribution

\vspace{2mm}

 - Count the proportion of the null distribution that is at least as extreme as observed statistic from the data
 
 
## The Permutation Test {.t}
\framesubtitle{A Python example, copied from DSC 10 (birthweight vs. maternal smoking)}

```{r, echo=FALSE, warning=FALSE}
library(reticulate)
```

```{python, eval=FALSE, size="scriptsize"}
# Function to calculate test statistic
def difference_in_group_means(weights_df):
    group_means = weights_df.groupby('Shuffled_Labels').mean().get('Birth Weight')
    return group_means.loc[False] - group_means.loc[True]

# Initialization for loop
n_repetitions = 1000 
differences = np.array([])

for i in np.arange(n_repetitions):
    # Step 1: Shuffle the labels to create two new samples.
    shuffled_labels = np.random.permutation(babies.get('Maternal Smoker'))
    
    # Step 2: Add them as a column to the DataFrame.
    shuffled = babies_with_shuffled.assign(Shuffled_Labels=shuffled_labels)
    
    # Step 3: Compute the difference in group means in the two new samples.
    difference = difference_in_group_means(shuffled)
    
    differences = np.append(differences, difference)

# Calculate p-value
np.count_nonzero(differences >= diff_in_means) / n_repetitions
```

 
## The Permutation Test {.t}
\label{yt2}
\framesubtitle{Your Turn \#2}
In an R Markdown document, write code to run a permutation test on a dataframe and calculate the p-value.

\vspace{4mm}
You may approach this by translating the Python code on the previous slide into R (along with changing things that need to be changed otherwise). 

\vspace{4mm}
Note: for this task, you may assume that the group labels will be "handwritten" and "coding"; that is, your function here does not need to be flexible with regard to that. 

\vspace{4mm}

Some things will work very similarly in R, but other things may need slightly different approaches.

## t-Test vs. Permutation Test {.t}
\label{compare}
So, which one should we do?

https://pollev.com/chi


## t-Test vs. Permutation Test {.t}
```{r, echo=FALSE, out.width="85%", fig.align='center'}
load("Lec06b.RData")

plot(all_power ~ n, pch=19, ylim=c(0,1), ylab="power", main=expression(paste("Power Curves with ", 
                                                                              delta, "=4.5, each sample is N(", mu[i], ",", sigma, "=4)")), 
     cex.main=2, cex.lab=2, cex.axis=2, cex=2)
lines(all_power[1:6] ~ n[1:6])

all_powert <- NA
for(i in 1:6){
  all_powert[i] <- power.t.test(n[i], delta=4.5, sd=4)$power
}

points(all_powert ~ n, pch=15, cex=2)
lines(all_powert ~ n)
legend("right", pch=c(15, 19), legend=c("t-Test",
                                        "permutation test"
                                        ), cex=2)
```

\vspace{-5mm}

\scriptsize Simulations for estimates of power in the permutation test were performed with 1000 simulated permutations, and 1000 simulated replicates.


## Recap and Looking Ahead {.t}

### Recap
\vspace{-2mm}

 - A/B testing simply refers to experimental studies
 
\vspace{3mm}

 - Experimental studies allow us to infer causation
   - It is much more difficult (though not strictly impossible) to infer causation from observational studies

\vspace{3mm}

 - If it's just two conditions, then the proper statistical analysis is either the two-sample t-Test or a permutation test
   - The two-sample t-Test is a \underline{parametric} test (requires distributional conditions)
   - The permutation test is a \underline{non-parametric} test (does not require distributional conditions)
   - Both are valid approaches!


## Recap and Looking Ahead {.t}

::: {.t}
### Summary of today's Daily Check
\vspace{-2mm}
1. The graph from Your Turn \#1 on Slide \ref{yt1}
2. The answer to the "conditions" and "validity" questions on Slide \ref{validity}
3. Permutation test from Your Turn \#2 on Slide \ref{yt2}
4. A description of the similarities and differences between a two-sample t-Test and a permutation test as discussed with the Poll Everywhere on Slide \ref{compare}

Put all of this into an R Markdown, and submit your pdf output to Gradescope.
:::

\vspace{5mm}

::: {.t}
### Next time
\vspace{-2mm}
Regression!

:::