Chapter 4 Simulation-Based Hypothesis Tests

4.1 Introduction to Hypothesis Testing via Simulation

4.1.1 Difference between North and South?

 ggplot(data=FloridaLakes, aes(x=Location, y=AvgMercury, fill=Location)) +  geom_boxplot() + geom_jitter() + ggtitle("Mercury Levels in Florida Lakes") +  xlab("Location") + ylab("Mercury Level") + theme(axis.text.x = element_text(angle = 90)) + coord_flip()

 FloridaLakes %>% group_by(Location) %>% summarize(MeanHg=mean(AvgMercury), StDevHg=sd(AvgMercury), N=n())
## # A tibble: 2 x 4 ## Location MeanHg StDevHg N ##    ## 1 N 0.425 0.270 33 ## 2 S 0.696 0.384 20

4.1.2 Model for Lakes Example

4.1.3 Model for Lakes R Output

 Lakes_M  lm(data=FloridaLakes, AvgMercury ~ Location) summary(Lakes_M)
## ## Call: ## lm(formula = AvgMercury ~ Location, data = FloridaLakes) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.65650 -0.23455 -0.08455 0.24350 0.67545 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.42455 0.05519 7.692 0.000000000441 *** ## LocationS 0.27195 0.08985 3.027 0.00387 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3171 on 51 degrees of freedom ## Multiple R-squared: 0.1523, Adjusted R-squared: 0.1357 ## F-statistic: 9.162 on 1 and 51 DF, p-value: 0.003868

4.1.4 Interpreting Lakes Regression Output

4.1.5 Differences between Northern and Southern Lakes?

Key Question:

4.1.6 Investigation by Simulation

We can answer the key question using simulation.

We’ll simulate situations where there is no relationship between location and mercury level, and see how often we observe a value of \(b_1\) as extreme as 0.27195.

Procedure:

  1. Randomly shuffle the locations of the lakes, so that any relationship between location and mercury level is due only to chance.
  2. Calculate the difference in mean mercury levels (i.e. value of \(b_1\) ) in “Northern” and “Southern” lakes, using the shuffled data.
  3. Repeat steps 1 and 2 many (say 10,000) times, recording the difference in means (i.e. value of \(b_1\) ) each time.
  4. Analyze the distribution of mean differences, simulated under the assumption that there is no relationship between location and mercury level. Look whether the actual difference we observed is consistent with the simulation results.

4.1.7 First Lakes Shuffle Simulation

 ShuffledLakes  FloridaLakes ## create copy of dataset ShuffledLakes$Location  ShuffledLakes$Location[sample(1:nrow(ShuffledLakes))] 
Lake Location AvgMercury Shuffled Location
Alligator S 1.23 N
Annie S 1.33 N
Apopka N 0.04 N
Blue Cypress S 0.44 N
Brick S 1.20 N
Bryant N 0.27 S

4.1.8 First Shuffle Model Results

Recall this model was fit under an assumption of no relationship between location and average mercury.

 M_Lakes_Shuffle  lm(data=ShuffledLakes, AvgMercury~Location) summary(M_Lakes_Shuffle)
## ## Call: ## lm(formula = AvgMercury ~ Location, data = ShuffledLakes) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.4891 -0.2591 -0.0440 0.2460 0.8009 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.529091 0.059944 8.826 0.00000000000761 *** ## LocationS -0.005091 0.097582 -0.052 0.959 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3444 on 51 degrees of freedom ## Multiple R-squared: 5.336e-05, Adjusted R-squared: -0.01955 ## F-statistic: 0.002722 on 1 and 51 DF, p-value: 0.9586

4.1.9 Second Lakes Shuffle Simulation

 ShuffledLakes  FloridaLakes ## create copy of dataset ShuffledLakes$Location  ShuffledLakes$Location[sample(1:nrow(ShuffledLakes))] kable(head(Shuffle1df))
Lake Location AvgMercury Shuffled Location
Alligator S 1.23 N
Annie S 1.33 N
Apopka N 0.04 N
Blue Cypress S 0.44 N
Brick S 1.20 N
Bryant N 0.27 S

4.1.10 Second Shuffle Model Results

Recall this model was fit under an assumption of no relationship between location and average mercury.

 M_Lakes_Shuffle  lm(data=ShuffledLakes, AvgMercury~Location) summary(M_Lakes_Shuffle)
## ## Call: ## lm(formula = AvgMercury ~ Location, data = ShuffledLakes) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.51485 -0.27485 -0.05485 0.27515 0.77515 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.55485 0.05961 9.308 0.00000000000141 *** ## LocationS -0.07335 0.09704 -0.756 0.453 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3425 on 51 degrees of freedom ## Multiple R-squared: 0.01108, Adjusted R-squared: -0.008313 ## F-statistic: 0.5713 on 1 and 51 DF, p-value: 0.4532

4.1.11 Third Lakes Shuffle Simulation

 ShuffledLakes  FloridaLakes ## create copy of dataset ShuffledLakes$Location  ShuffledLakes$Location[sample(1:nrow(ShuffledLakes))] kable(head(Shuffle1df))
Lake Location AvgMercury Shuffled Location
Alligator S 1.23 N
Annie S 1.33 N
Apopka N 0.04 N
Blue Cypress S 0.44 N
Brick S 1.20 N
Bryant N 0.27 N

4.1.12 Third Shuffle Model Results

Recall this model was fit under an assumption of no relationship between location and average mercury.

 M_Lakes_Shuffle  lm(data=ShuffledLakes, AvgMercury~Location) summary(M_Lakes_Shuffle)
## ## Call: ## lm(formula = AvgMercury ~ Location, data = ShuffledLakes) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.56200 -0.26200 -0.08182 0.22818 0.74818 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.48182 0.05905 8.16 0.0000000000818 *** ## LocationS 0.12018 0.09612 1.25 0.217 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3392 on 51 degrees of freedom ## Multiple R-squared: 0.02974, Adjusted R-squared: 0.01072 ## F-statistic: 1.563 on 1 and 51 DF, p-value: 0.2169

4.1.13 Fourth Lakes Shuffle Simulation

 ShuffledLakes  FloridaLakes ## create copy of dataset ShuffledLakes$Location  ShuffledLakes$Location[sample(1:nrow(ShuffledLakes))] kable(head(Shuffle1df))
Lake Location AvgMercury Shuffled Location
Alligator S 1.23 N
Annie S 1.33 S
Apopka N 0.04 S
Blue Cypress S 0.44 N
Brick S 1.20 N
Bryant N 0.27 N

4.1.14 Fourth Shuffle Model Results

Recall this model was fit under an assumption of no relationship between location and average mercury.

 M_Lakes_Shuffle  lm(data=ShuffledLakes, AvgMercury~Location) summary(M_Lakes_Shuffle)
## ## Call: ## lm(formula = AvgMercury ~ Location, data = ShuffledLakes) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.5261 -0.2730 -0.0630 0.2470 0.7639 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.56606 0.05929 9.548 0.00000000000061 *** ## LocationS -0.10306 0.09651 -1.068 0.291 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3406 on 51 degrees of freedom ## Multiple R-squared: 0.02187, Adjusted R-squared: 0.002691 ## F-statistic: 1.14 on 1 and 51 DF, p-value: 0.2906

4.1.15 Fifth Lakes Shuffle Simulation

 ShuffledLakes  FloridaLakes ## create copy of dataset ShuffledLakes$Location  ShuffledLakes$Location[sample(1:nrow(ShuffledLakes))] kable(head(Shuffle1df))
Lake Location AvgMercury Shuffled Location
Alligator S 1.23 N
Annie S 1.33 N
Apopka N 0.04 N
Blue Cypress S 0.44 S
Brick S 1.20 S
Bryant N 0.27 N

4.1.16 Fifth Shuffle Model Results

Recall this model was fit under an assumption of no relationship between location and average mercury.

 M_Lakes_Shuffle  lm(data=ShuffledLakes, AvgMercury~Location) summary(M_Lakes_Shuffle)
## ## Call: ## lm(formula = AvgMercury ~ Location, data = ShuffledLakes) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.50455 -0.24850 -0.05455 0.22545 0.78545 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.54455 0.05981 9.104 0.00000000000286 *** ## LocationS -0.04605 0.09737 -0.473 0.638 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.3436 on 51 degrees of freedom ## Multiple R-squared: 0.004366, Adjusted R-squared: -0.01516 ## F-statistic: 0.2236 on 1 and 51 DF, p-value: 0.6383

4.1.17 Code for Simulation Investigation

 b1  Lakes_M$coef[2] ## record value of b1 from actual data ## perform simulation b1Sim  rep(NA, 10000) ## vector to hold results ShuffledLakes  FloridaLakes ## create copy of dataset for (i in 1:10000)  #randomly shuffle locations ShuffledLakes$Location  ShuffledLakes$Location[sample(1:nrow(ShuffledLakes))] ShuffledLakes_M lm(data=ShuffledLakes, AvgMercury ~ Location) #fit model to shuffled data b1Sim[i]  ShuffledLakes_M$coef[2] ## record b1 from shuffled model > NSLakes_SimulationResults  data.frame(b1Sim) #save results in dataframe

4.1.18 Simulation Results

 NSLakes_SimulationResultsPlot  ggplot(data=NSLakes_SimulationResults, aes(x=b1Sim)) +  geom_histogram(fill="lightblue", color="white") +  geom_vline(xintercept=c(b1, -1*b1), color="red") +  xlab("Lakes: Simulated Value of b1") + ylab("Frequency") +  ggtitle("Distribution of b1 under assumption of no relationship") NSLakes_SimulationResultsPlot

It appears unlikely that we would observe a value of \(b_1\) as extreme as 0.27195 ppm by chance, if there is really no relationship between location and mercury level.

4.1.19 Conclusions

Number of simulations resulting in simulation value of \(b_1\) more extreme than 0.27195.

 sum(abs(b1Sim) > abs(b1))
## [1] 31

Proportion of simulations resulting in simulation value of \(b_1\) more extreme than 0.27195.

 mean(abs(b1Sim) > abs(b1))
## [1] 0.0031

The probability of observing a value of \(b_1\) as extreme as 0.27195 by chance, when there is no relationship between location and mercury level is very low.

There is strong evidence of a relationship between location and mercury level. In this case, there is strong evidence that mercury level is higher in Southern Lakes than northern Lakes.

4.1.20 Recall Lakes Bootstrap for Difference

## 2.5% 97.5% ## 0.08095682 0.46122992
 NS_Lakes_Bootstrap_Plot_b1 + geom_vline(xintercept = c(q.025, q.975), color="red")

We are 95% confident the average mercury level in Southern Lakes is between 0.08 and 0.46 ppm higher than in Northern Florida.

The fact that the interval does not contain 0 is consistent with the hypothesis test.

4.1.21 Hypothesis Testing Terminology

We can think of the simulation as a test of the following hypotheses:

Hypothesis 1: There is no relationship between location and mercury level. (Thus the difference of 0.27 we observed in our data occurred just by chance).

Hypothesis 2: The difference we observed did not occur by chance (suggesting a relationship between location and mercury level).

The “no relationship,” or “chance alone” hypothesis is called the null hypothesis. The other hypothesis is called the alternative hypothesis.

4.1.22 Hypothesis Testing Terminology (Continued)

We used \(b_1\) to measure difference in mean mercury levels between the locations in our observed data.

We found that the probability of observing a difference as extreme as 0.27 when Hypothesis 1 is true is very low (approximately 0.0017)