```{r setup, include=FALSE}
  knitr::opts_chunk$set(echo = TRUE)
  knitr::opts_chunk$set(dev = 'pdf')
  def.chunk.hook  <- knitr::knit_hooks$get("chunk")
  knitr::knit_hooks$set(chunk = function(x, options) {
    x <- def.chunk.hook(x, options)
    ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
  })
```


## Remember this?

![](bacon.jpeg)

## Remember this?

![](guardianbacon.png)

## Remember this?

![](carcinogenicity.png)

The "18\% increase" represents a "relative risk" of 1.18. What is a relative risk?

## Binary Outcome Variables {.t}

::: {.t}
### The outcome variable in this example is binary:

 - Get colon cancer
 - Don't get colon cancer

:::


\vspace{5mm}

::: {.t}
### And in fact, the predictor variable is also binary:

 - Eat $\geq$ 50g/day of red/processed meats
 - Do not eat $\geq$ 50g/day of red/processed meats

:::

## The data {.t}

::: {.t}
### The raw data look (something)$^*$ like this:

\begin{center}
  \begin{tabular}{ c c|c|c }
	   & \multicolumn{2}{c}{colon cancer} \\
	    \multicolumn{1}{c|}{eat meats} & yes & no & \\ 
    \hline
      \multicolumn{1}{c|}{yes} & 13,513 & 215,725 \\ 
    \hline
  \multicolumn{1}{c|}{no} & 12,018 & 228,355 \\
    \hline
     \multicolumn{1}{c|}{} &\multicolumn{1}{|c|}{} &
  \end{tabular}
\end{center}	
:::

\vspace{5mm}

$^*$These data were generated by Claude, under directions to make them as similar as possible to the real data based on what information is publicly available.

## The data formation (in case you are curious) {.t}

![](Claude1.png){width=75%}

## The data formation (in case you are curious) {.t}

![](Claude2.png)

## Now what is a relative risk? {.t}
\label{rr}

::: {.t}
### A relative risk (RR) is simply:
\vspace{-2mm}
$$
RR = \frac{p_1}{p_2}
$$
:::

where:

 - $p_1$ is the probability of observing the outcome of interest among the exposed group
 - $p_2$ is the probability of observing the outcome of interest among the unexposed group
 
 
Here, they said it's 1.18...

## Relative Risk of Colon Cancer from Bacon {.t}

::: {.t}
### The data again
\begin{center}
  \begin{tabular}{ c c|c|c }
	   & \multicolumn{2}{c}{colon cancer} \\
	    \multicolumn{1}{c|}{eat meats} & yes & no & totals \\ 
    \hline
      \multicolumn{1}{c|}{yes} & 13,513 & 215,725 & 229,238 \\ 
    \hline
  \multicolumn{1}{c|}{no} & 12,018 & 228,355 & 240,373 \\
    \hline
     \multicolumn{1}{c|}{} &\multicolumn{1}{|c|}{} &
  \end{tabular}
\end{center}	

:::

$$
\widehat{RR} = \frac{\frac{13,513}{229,238}}{\frac{12,018}{240,373}} \approx 1.18
$$

Now, what about a statistical test?

## Relative Risk of Colon Cancer from Bacon {.t}

If there is no impact of bacon on colon cancer, then we would expect that $RR=1$. So,
$$
\begin{align*}
H_0\colon RR=1 \\
H_A\colon RR \neq 1
\end{align*}
$$

which is also equivalent to 

$$
\begin{align*}
H_0\colon p_1 = p_2 \\
H_A\colon p_1 \neq p_2
\end{align*}
$$

There are lots of different ways to test this...

## Let's do a permutation test {.t}
Here is a sample of the dataframe:

```{r, echo=FALSE}
bacon <- data.frame(colon_cancer=c(rep(1, 13513+12018), rep(0, 215725+228355)),
                    eat_meats=c(rep(1, 13513), rep(0, 12018), rep(1, 215725), rep(0, 228355)))

knitr::kable(bacon[sample(dim(bacon)[1], 10), ])
```

Now we can just shuffle either column...


## Let's do a permutation test {.t}

::: {.t}
### Your Turn \#1
\vspace{-2mm}
Write R code for a permutation test here:

 - First, you should write a function to calculate the observed RR for any given dataframe. 
   - Your function may assume that the input will be a dataframe in which the first column is the outcome variable, and the 2nd column is the exposure variable.
 - Then, do the shuffling on the dataframe
 - Store the RR from each permuted dataframe into a vector
 - Find the p-value
:::

## Let's do a permutation test {.t}
\framesubtitle{Your Turn \#1}


For this illustration, here is code to create a smaller dataframe that you may simply copy/retype:
```{r, eval=FALSE, size="footnotesize"}
bacon <- data.frame(colon_cancer=c(rep(1, 135+120), 
                                   rep(0, 2157+2283)),
                    eat_meats=c(rep(1, 135), rep(0, 120), 
                                rep(1, 2157), rep(0, 2283)))
```

A couple of points:

 - If we used the full dataframe, it would take much longer to run, and also would give a p-value of basically 0 since the sample size is so large.
 - The key is figuring out how to calculate estimates of $p_1$ and $p_2$ as defined on Slide \ref{rr}.

## Limitations of the relative risk {.t}
Note that we were able to calculate the RR here because the data were cross-sectional. What do we mean by that?

\vspace{5mm}

This is in contrast to a "cohort study" and a "case-control study." 


## Limitations of the relative risk {.t}
\framesubtitle{Summary}

::: {.t}
### Cross-sectional study
\vspace{-2mm}
Data are collected irrespective of exposure or outcome status (i.e. just a simple random sample of the population). RR can be calculated.

:::

\vspace{5mm}

::: {.t}
### Cohort study
\vspace{-2mm}
Study participants are either split into cohorts: those who are exposed, and those who are unexposed, and observed over time -- or, they are sampled based on exposure status. RR can be calculated.

:::

\vspace{5mm}

::: {.t}
### Case-control study
\vspace{-2mm}
Study participants are sampled based on outcome status. RR can NOT calculated.

:::


## Limitations of the relative risk {.t}
But, the case-control study design is sometimes a very appealing one, specifically when the outcome is rare! Why?

::: {.t}
### Benefits of the case-control study design
\vspace{-2mm}

 - Suppose we are studying a very rare disease. If we do a cross-sectional or cohort study and do not have a gigantic sample size, we might not get any individuals with the disease into our study!
 
\vspace{3mm}

 - If we instead sample participants into our study based on their disease status, we can ensure that we have individuals with the disease in our study.


::: 

But if we do that, then we have a denominator issue if we wanted to calculate the RR! Why?


## Limitations of the relative risk {.t}
Suppose we are studying bladder cancer (which has a population prevalence of approximately 0.023\%) and want to investigate whether consumption of artificial sweeteners causes this cancer. 

\vspace{2mm}

Side question: what would be the ideal study design and why can't we do that?

\vspace{6mm}

### So instead we perform a case-control study in which:
\vspace{-2mm}
 - We sample 100 cases and 100 controls
 - We determine how many of each regularly consume artificial sweeteners


## Limitations of the relative risk {.t}

::: {.t}
### The data look like this:
\vspace{-1mm}

\begin{center}
  \begin{tabular}{ c c|c|c }
	   & \multicolumn{2}{c}{bladder cancer} \\
	    \multicolumn{1}{c|}{sweetener use} & yes & no & totals \\ 
    \hline
      \multicolumn{1}{c|}{yes} & 33 & 25 & 58 \\ 
    \hline
  \multicolumn{1}{c|}{no} & 67 & 75 & 142 \\
    \hline
     \multicolumn{1}{c|}{totals} & 100 & 100 & 200
  \end{tabular}
\end{center}	

:::

If we wanted to calculate an estimate of the RR, it would be:
$$
\widehat{RR} = \frac{\frac{33}{58}}{\frac{67}{142}}
$$

But based on our study design, 58 and 142 are not valid denominators! 

 - For example, this would suggest that 33 out of 58 people who use sweeteners get bladder cancer (approximately 57\%). 
 - Recall that the population prevalence of bladder cancer is approximately 0.023\%!


## Limitations of the relative risk {.t}

::: {.t}
### The data look like this:
\vspace{-1mm}

\begin{center}
  \begin{tabular}{ c c|c|c }
	   & \multicolumn{2}{c}{bladder cancer} \\
	    \multicolumn{1}{c|}{sweetener use} & yes & no & totals \\ 
    \hline
      \multicolumn{1}{c|}{yes} & 33 & 25 & 58 \\ 
    \hline
  \multicolumn{1}{c|}{no} & 67 & 75 & 142 \\
    \hline
     \multicolumn{1}{c|}{totals} & 100 & 100 & 200
  \end{tabular}
\end{center}	

:::

Conversely, it is valid to calculate:
$$
\frac{\frac{33}{100}}{\frac{25}{100}}
$$
as each 100 are valid denominators. But this doesn't exactly answer the question we wanted...

 - Out of those who have bladder cancer, 33\% of them used sweeteners
 - Out of those who do not have bladder cancer, 25\% of them used sweeteners

## Limitations of the relative risk {.t}

::: {.t}

### The quantity shown on the previous slide is pretty much never used. 
\vspace{-2mm}
 - Statistically, it is a fine measure and could be validly used for hypothesis testing.
 - But it does not have an interpretation that we want.

:::

Specifically, we frequently want to know answers to questions like, "What is the increase in your risk of bladder cancer associated with using artificial sweeteners?"

This question is not possible to answer with the quantity on the previous slide. 

## Odds Ratios {.t}
So, if we can't calculate the RR in a case-control study, and we don't want to use that unnamed quantity from the last couple of slides, what can we do instead?

\vspace{4mm}

### Case-control studies use Odds Ratios
\vspace{-2mm}
What are Odds Ratios?

\vspace{3mm}

First, let's define an "odds."

## What is an "odds"? {.t}
In everyday language, the words "probability" and "odds" are frequently used interchangeably. 

::: {.t}
### Examples:
\vspace{-2mm}
 - "What are the odds that it will rain tomorrow?"
 - "What are the odds that this will show up on the exam?"
:::

When we hear questions like this, we usually answer it as if it was asking for a probability. This is generally accepted in daily conversation, but is technically incorrect!

## Mathematical definition of an "odds" {.t}

If $p$ is a probability of some event, then the odds of that event is defined as:
$$
\frac{p}{1-p}
$$

It is also often expressed as a ratio in this manner: X:Y where $X$ and $Y$ are integer values.

::: {.t}
### Examples:
\vspace{-2mm}
 - If the probability that some sports team will win tomorrow's game is 60\%, then their odds of winning the game is 3:2
 - If the probability of getting a card that will give you the winning poker hand is 10\%, then your odds of winning are 1:9

:::

Sidenote: the gambling community is one of the few that actually uses the term "odds" correctly!


## The Odds Ratio {.t}
\label{ordef}
Now, the odds ratio is defined as:
$$
OR = \frac{\frac{p_1}{1-p_1}}{\frac{p_2}{1-p_2}}
$$

where:

 - $p_1$ is the probability of observing \underline{the exposure} among \underline{the cases}
 - $p_2$ is the probability of observing \underline{the exposure} among \underline{the controls}
 
Note carefully the consistency of what the denominator is, and the study design. Also note the difference between the $p_1$ and $p_2$ here vs. in the definition of a relative risk!
 
## The Odds Ratio {.t}

::: {.t}
### So what's the odds ratio for these data?
\vspace{-1mm}

\begin{center}
  \begin{tabular}{ c c|c|c }
	   & \multicolumn{2}{c}{bladder cancer} \\
	    \multicolumn{1}{c|}{sweetener use} & yes & no & totals \\ 
    \hline
      \multicolumn{1}{c|}{yes} & 33 & 25 & 58 \\ 
    \hline
  \multicolumn{1}{c|}{no} & 67 & 75 & 142 \\
    \hline
     \multicolumn{1}{c|}{totals} & 100 & 100 & 200
  \end{tabular}
\end{center}	

:::


## The Odds Ratio {.t}
A couple of notes:

::: {.t}
### A shortcut for calculating the odds ratio
\vspace{-2mm}
The odds ratio on the previous slide is also equal to just:
$$
\widehat{OR} = \frac{33 \times 75}{25 \times 76}
$$
:::

\vspace{5mm}

::: {.t}
### The odds ratio is symmetric!
\vspace{-2mm}
In the definition of the odds ratio on Slide \ref{ordef}, we defined $p_1$ and $p_2$ as:

 - $p_1$ is the probability of observing the exposure among the cases
 - $p_2$ is the probability of observing the exposure among the controls
 
because these are the values that are possible to calculate in a case-control design...

:::


## The Odds Ratio {.t}

::: {.t}
### The odds ratio is symmetric!
\vspace{-2mm}
Thus, the odds ratio can be interpreted as: "the increase in the odds of having been exposed, among cases
as compared to controls."

\vspace{3mm}

However, consider again the bacon data:

\begin{center}
  \begin{tabular}{ c c|c|c }
	   & \multicolumn{2}{c}{colon cancer} \\
	    \multicolumn{1}{c|}{eat meats} & yes & no & totals \\ 
    \hline
      \multicolumn{1}{c|}{yes} & 13,513 & 215,725 & 229,238 \\ 
    \hline
  \multicolumn{1}{c|}{no} & 12,018 & 228,355 & 240,373 \\
    \hline
     \multicolumn{1}{c|}{} & 25,531 & 444,080 & 469,611
  \end{tabular}
\end{center}	


:::

## The Odds Ratio {.t}
\framesubtitle{Your Turn \#2:}
\label{yt2}

 - First, calculate the odds ratio according to the definition on Slide \ref{ordef} (for illustration, don't go straight to the shortcut).
   - This will represent the increase in the odds of being a processed meat eater associated with having colon cancer.
   
\vspace{3mm}

 - Then, calculate the odds ratio in the OTHER direction, to find the increase in the odds of having colon cancer, associated with being a processed meat eater.
 
\vspace{3mm}

 - Comment briefly on what you observe.
 

## The Odds Ratio {.t}

::: {.t}
### Again, the odds ratio is symmetric!
\vspace{-2mm}
What this means is that:

 - The odds ratio of being \underline{exposed} comparing \underline{cases to controls} IS EQUAL to the odds ratio of being \underline{a case} comparing \underline{exposed to unexposed} individuals!

\vspace{3mm}

 - In a case-control study where it is impossible to calculate a relative risk of being a case comparing exposed to unexposed individuals:
   - We can still calculate the odds ratio of being exposed comparing cases to controls!
   - And this is actually mathematically equal to the thing rather know!
:::


## The Odds Ratio {.t}
\framesubtitle{It's still not quite the relative risk though...}

Note that $OR \neq RR$. Does it matter?

::: {.t}
### Interpretable Effect Size
\vspace{-2mm}
In the bladder cancer example, we found $\widehat{OR} \approx `r round((33*75)/(25*76), 2)`$. What does this mean?

\vspace{3mm}

If it had been $RR=`r round((33*75)/(25*76), 2)`$, it is correct to interpret this to mean that using artificial sweeteners increases your risk of bladder cancer by 30\%. What can we say about the odds ratio?
:::

An OR of `r round((33*75)/(25*76), 2)` means a 30\% increase in the \underline{odds}. How does this differ from the relative risk?


## The Odds Ratio {.t}
\framesubtitle{It's still not quite the relative risk though...}

::: {.t}
### A little example:
\vspace{-2mm}
Suppose these are cross-sectional data (so the relative risk is valid to calculate):
\begin{center}
  \begin{tabular}{ c c|c|c }
	   & \multicolumn{2}{c}{disease} \\
	    \multicolumn{1}{c|}{exposure} & yes & no & totals \\ 
    \hline
      \multicolumn{1}{c|}{yes} & 7 & 4 & 11 \\ 
    \hline
  \multicolumn{1}{c|}{no} & 3 & 6 & 9 \\
    \hline
     \multicolumn{1}{c|}{} & 10 & 10 & 20
  \end{tabular}
\end{center}	

:::

$$
\begin{aligned}
\widehat{OR} &= \frac{7 \times 6}{4 \times 3} = 3.5 \\
\widehat{RR} &= \frac{7/11}{3/9} \approx 1.91
\end{aligned}
$$
Not close at all!


## The Odds Ratio {.t}
\framesubtitle{But, if the disease is rare...}

::: {.t}
### Another little example:
\vspace{-2mm}
Again suppose these are cross-sectional data (so the relative risk is valid to calculate):
\begin{center}
  \begin{tabular}{ c c|c|c }
	   & \multicolumn{2}{c}{disease} \\
	    \multicolumn{1}{c|}{exposure} & yes & no & totals \\ 
    \hline
      \multicolumn{1}{c|}{yes} & 5 & 995 & 1000 \\ 
    \hline
  \multicolumn{1}{c|}{no} & 3 & 997 & 1000 \\
    \hline
     \multicolumn{1}{c|}{} & 8 & 1992 & 2000
  \end{tabular}
\end{center}	
:::


$$
\begin{aligned}
\widehat{OR} &= \frac{5 \times 997}{995 \times 3} \approx 1.670 \\
\widehat{RR} &= \frac{5/1000}{3/1000} \approx 1.666
\end{aligned}
$$

Very close!

## The Odds Ratio {.t}
\framesubtitle{For rare outcomes, $OR \approx RR$}

Note that:

$$
OR = \frac{\frac{p_1}{1-p_1}}{\frac{p_2}{1-p_2}} \qquad vs. \qquad  RR = \frac{p_1}{p_2} 
$$

what happens with $p_1$ and $p_2$ are very small?

## Recap and Looking Ahead

::: {.t}
### Recap
\vspace{-2mm}

 - The relative risk (RR) and odds ratio (RR) are both valid measures of association for binary outcome variables
 - The RR can only be calculated for cross-sectional and cohort study data
 - The OR can be calculated for case-control study data
 - The OR is symmetric, and is also a good approximation of the RR when the outcome is rare
 
:::

\vspace{3mm}

::: {.t}
### Looking Ahead
\vspace{-2mm}
Logistic Regression
:::

\vspace{3mm}

::: {.t}
### Today's Daily Check
\vspace{-2mm}
The two Your Turns (one coding, one math)