Data Analysis with R

In this notebook, we will show how to solve a variety of statistical analysis problems using R.

The selection of problems is not intended to be complete, but it provides a gloss of the myriad ways to approach data analysis with R.

Do not hesitate to consult online resources for more examples (StackOverflow, R-bloggers, etc.).

1. SAMPLING

In this section, we illustrate how to create SRS and StS estimates from scratch.

1.1 Conceptual Questions

You are charged with estimating the yearly salary of data scientists in Canada. Identify potential:

populations (target, study, respondent, sampling frames)
samples (intended, achieved)
unit information (unit, response variate, population attribute)
sources of bias (coverage, nonresponse, sampling, measurement) and variability (sampling, measurement).

Solution: here are some possible answers.

Target Population: all data scientists in Canada (is that a well-defined population?)
Study Population: this one poses a big challenge as there is no professional association of data scientists in Canada which is likely to contain even a fair number of data scientists (at least, none that we are aware/members of). Perhaps we could use membership lists of the Statistical Society of Canada (SSC)? Or scrape social platforms like LinkedIn for Canadian data scientists?
Sampling Frame: SSC membership directory, or the scraped LinkedIn data.
Intended Sample: a sample of SSC members or LinkedIn account holders who have been identified as data scientists.
Respondent Population: those SSC members or LinkedIn-identified data scientists who would respond if selected to participate.
Unit: a data scientist.
Response Variate: the salary of a data scientist.
Population Attribute: the average salary of all data scientists in the population.
Coverage Bias: in terms of salary, SSC members or LinkedIn-identified data scientists might not be representative of all data scientists in Canada.
Nonresponse Bias: in terms of salary, SSC members or LinkedIn-identified data scientists who would respond if selected might not be representative of all data scientists in Canada.
Sampling Bias: for those SSC members or LinkedIn-identified data scientists who would respond if selected, it could be that those that are sampled are not representative (in terms of salary) of all who would have responded if selected.
Measurement Bias: for some or all of the SSC members or LinkedIn-identified data scientists who participated in the study, the actual salaries might not have been assessed correctly.
Sampling Variability: different samples of SSC members or LinkedIn-identified data scientists who would respond if selected might produce different results in terms of salary.
Measurement Variability: for those SSC members or LinkedIn-identified data scientists who participated in the survey, reapeatedly asking an individual about their salary could produce a number of different responses.

1.2 `cities`

The file cities.csv contains population information for a country’s cities. A city is classified as “small” if its population is below 75K, as “medium” if it falls between 75K and 1M, and as “large” otherwise.

Locate and load the file into the workspace of your choice. How many cities are there? How many in each group?
Display summary population statistics for the cities, both overall and by group.
Compute a \(95\%\) C.I. for the 1999 population mean using a simple random sample (SRS) of size \(n=10\). Compare the estimates with the true value. Are the results surprising? If not, could they have been?
Compute a \(95\%\) C.I. for the 1999 population mean using a simple stratified sample (StS) of size \((n_s,n_m,n_l)=(5,3,2)\). Compare the estimates with the true value. Are the results surprising? If not, could they have been?

Solutions:

The file is located in the Data folder. It can be loaded using the regular functionality, and investigated at a high-level using the str() and summary() calls. A simple visualization is produced using ggplot2’s qplot().

cities <- read.csv('Data/cities.csv',sep=",") 
str(cities)

## 'data.frame':    37 obs. of  3 variables:
##  $ category       : Factor w/ 3 levels "large","medium",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ population_1999: int  4590 5210 6030 6580 7300 7940 7960 8430 8750 9080 ...
##  $ population_2019: int  5090 7090 7710 8120 7320 6780 10050 11060 12140 12480 ...

summary(cities)

##    category  population_1999   population_2019  
##  large : 2   Min.   :   4590   Min.   :   5090  
##  medium: 6   1st Qu.:   9080   1st Qu.:  12480  
##  small :29   Median :  29500   Median :  31160  
##              Mean   : 201091   Mean   : 249796  
##              3rd Qu.:  43360   3rd Qu.:  52250  
##              Max.   :3009490   Max.   :3742940

library(ggplot2)
qplot(population_1999,population_2019,data=cities)

Evidently, the dataset consists of the populations of 37 cities, in 1999 and in 2019; 2 of which are classified as large, 6 as medium and 29 as small. Does this make sense to you? Are the classifications constant over the 20 year interval?

Summary statistics can be gleamed from the describe() function (loaded with the psych package).

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked _by_ '.GlobalEnv':
## 
##     cities

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

describe(cities[,2:3])

##                 vars  n     mean       sd median  trimmed      mad  min
## population_1999    1 37 201090.8 666239.4  29500 41693.23 30274.69 4590
## population_2019    2 37 249796.5 838340.9  31160 49518.39 28199.05 5090
##                     max   range skew kurtosis       se
## population_1999 3009490 3004900 3.74    12.49 109529.1
## population_2019 3742940 3737850 3.75    12.48 137822.4

The statistics by groups can be obtained by first separating the data into 3 subsets (that’s not the only way to do this, of course… we’ll re-visit at a later stage in this module):

cities_large <- subset(cities, category=="large") # note the "==", instead of simply "="
summary(cities_large)

##    category population_1999   population_2019  
##  large :2   Min.   :2869350   Min.   :3652020  
##  medium:0   1st Qu.:2904385   1st Qu.:3674750  
##  small :0   Median :2939420   Median :3697480  
##             Mean   :2939420   Mean   :3697480  
##             3rd Qu.:2974455   3rd Qu.:3720210  
##             Max.   :3009490   Max.   :3742940

describe(cities_large[,2:3])

##                 vars n    mean       sd  median trimmed      mad     min
## population_1999    1 2 2939420 99093.94 2939420 2939420 103885.8 2869350
## population_2019    2 2 3697480 64290.15 3697480 3697480  67399.0 3652020
##                     max  range skew kurtosis    se
## population_1999 3009490 140140    0    -2.75 70070
## population_2019 3742940  90920    0    -2.75 45460

cities_medium <- subset(cities, category=="medium")
summary(cities_medium)

##    category population_1999  population_2019 
##  large :0   Min.   : 80720   Min.   : 81750  
##  medium:6   1st Qu.: 93658   1st Qu.:107170  
##  small :0   Median :148265   Median :176750  
##             Mean   :152875   Mean   :179953  
##             3rd Qu.:195575   3rd Qu.:244448  
##             Max.   :253200   Max.   :293480

describe(cities_medium[,2:3])

##                 vars n     mean       sd median  trimmed       mad   min
## population_1999    1 6 152875.0 70933.87 148265 152875.0  78459.19 80720
## population_2019    2 6 179953.3 90642.84 176750 179953.3 102751.59 81750
##                    max  range skew kurtosis       se
## population_1999 253200 172480 0.19    -1.97 28958.63
## population_2019 293480 211730 0.07    -2.14 37004.78

cities_small <- subset(cities, category=="small")
summary(cities_small)

##    category  population_1999 population_2019
##  large : 0   Min.   : 4590   Min.   : 5090  
##  medium: 0   1st Qu.: 8430   1st Qu.:11060  
##  small :29   Median :20520   Median :25000  
##              Mean   :22216   Mean   :26476  
##              3rd Qu.:30900   3rd Qu.:37120  
##              Max.   :58080   Max.   :63730

describe(cities_small[,2:3])

##                 vars  n     mean       sd median trimmed      mad  min
## population_1999    1 29 22216.21 14378.98  20520 21321.2 17924.63 4590
## population_2019    2 29 26475.52 16402.64  25000 25597.6 20667.44 5090
##                   max range skew kurtosis      se
## population_1999 58080 53490 0.56    -0.68 2670.11
## population_2019 63730 58640 0.41    -0.96 3045.89

Let’s start by picking a sample of size \(n=10\) for the data.

set.seed(222)# fixing the seed for replicability purposes
N=dim(cities)[1]
n=10
(cities_SRS = cities[sample(1:N,n, replace=FALSE),])

##    category population_1999 population_2019
## 15    small           20520           27710
## 18    small           21640           25000
## 23    small           35520           45800
## 24    small           35520           48280
## 22    small           30900           34100
## 26    small           40940           49040
## 20    small           29830           39120
## 9     small            8750           12140
## 10    small            9080           12480
## 30   medium           80720           81750

For each year, the SRS estimate for \(\mu\) is given simply by computing the mean of the population in the sample:

(mean_SRS=sapply(Filter(is.numeric, cities_SRS), mean))

## population_1999 population_2019 
##           31342           37542

Under SRS, the bound on the error is given by \[B=2\sqrt{\frac{s^2}{n}\left(1-\frac{n}{N}\right)},\] where \(s\) is the standard deviation of the sample, \(n\) the sample size, and \(N\) the population size:

(sd_SRS=sapply(Filter(is.numeric, cities_SRS), sd))

## population_1999 population_2019 
##        20507.29        20579.72

(bound_SRS = 2*sqrt(sd_SRS^2/n*(1-n/N)))

## population_1999 population_2019 
##        11079.48        11118.61

Finally, the \(95\%\) SRS C.I. is given by mean_SRS ± bound_SRS, which yields:

(CI_SRS = cbind(mean_SRS-bound_SRS,mean_SRS+bound_SRS))

##                     [,1]     [,2]
## population_1999 20262.52 42421.48
## population_2019 26423.39 48660.61

That is, we would expect the mean population of the cities to fall between the lower bound and the upper bound obtained this way roughly 19 times out of every 20 times we sampled the data using SRS.

There is just one problem: the true means were \(201,091\) in 1999 and \(249,796\) in 2019, so they are not, in fact in the confidence intervals!

This is due, of course, to thespecific sample that was selected, and which does not contain a large city. If we try with set.seed(2) instead of set.seed(22) above, we obtain a sample with a large city, and the true population means are indeed in the corresponding \(95\%\) C.I. (there are even some samples that give negative lower bounds).

A tighter confidence interval can be built using stratified random sampling (StS); as advised, we shall pick a sample of size \(n=10\), but we will allocate it according to the following rule: \((n_s,n_m,n_l)=(5,3,2)\). This has the effect of guaranteeing that both large cities are selected in the sample. Note that we have \((N_s,N_m,N_l)=(29,6,2)\).

# compute the estimate and bound components from the "large" stratum
n_l=2
N_l=2
cities_large_StS = cities_large[sample(1:N_l,n_l, replace=FALSE),]
mean_large_SRS = sapply(Filter(is.numeric, cities_large_StS), mean)
sd_large_SRS = sapply(Filter(is.numeric, cities_large_StS), sd)
bound_large_SRS = 1/N^2*N_l^2*sd_large_SRS^2/n_l*(1-n_l/N_l)

# compute the estimate and bound components from the "medium" stratum
n_m=3
N_m=6
cities_medium_StS = cities_medium[sample(1:N_m,n_m, replace=FALSE),]
mean_medium_SRS = sapply(Filter(is.numeric, cities_medium_StS), mean)
sd_medium_SRS = sapply(Filter(is.numeric, cities_medium_StS), sd)
bound_medium_SRS = 1/N^2*N_m^2*sd_medium_SRS^2/n_m*(1-n_m/N_m)

# compute the estimate and bound components from the "small" stratum
n_s=5
N_s=29
cities_small_StS = cities_small[sample(1:N_s,n_s, replace=FALSE),]
mean_small_SRS = sapply(Filter(is.numeric, cities_small_StS), mean)
sd_small_SRS = sapply(Filter(is.numeric, cities_small_StS), sd)
bound_small_SRS = 1/N^2*N_s^2*sd_small_SRS^2/n_s*(1-n_s/N_s)

# compute the StS esimate and bound
(mean_StS = N_l/N*mean_large_SRS + N_m/N*mean_medium_SRS + N_s/N*mean_small_SRS)

## population_1999 population_2019 
##        199651.1        246210.5

bound_StS = 2*sqrt(bound_large_SRS + bound_medium_SRS + bound_small_SRS)

# derive the StS 95% CI
(CI_StS = cbind(mean_StS-bound_StS,mean_StS+bound_StS))

##                     [,1]     [,2]
## population_1999 185450.5 213851.7
## population_2019 232668.4 259752.6

The \(95\%\) C.I. are much tighter (and incidentally, provide estimates that are closer to the true value, although that is secondary to the tightness of the intervals).

2. RANDOM VARIABLES AND DISTRIBUTIONS

Write R code that lets you draw “random”" samples from the various distributions discussed in the slide deck section on distributions.

2.1 Uniform \(U(a,b)\)

Uniformly distributed numbers can be generated by runif() (call ?runif at the prompt to get the function information).

The default call is runif(n), producing \(n\) uniformly distributed numbers on \([0,1]\).

runif(1) # generate a single number from U(0,1)

## [1] 0.5122694

(x<-runif(10))

##  [1] 0.24427790 0.95997506 0.72662884 0.47102506 0.42161913 0.90393694
##  [7] 0.11783592 0.67981691 0.02245412 0.53831510

hist(x)

We can also compare the sample mean and variance with the true (theoretical) mean and variance.

(mean_x = mean(x))

## [1] 0.5085885

(true_mean = (1+0)/2)

## [1] 0.5

(var_x = var(x))

## [1] 0.1006271

(true_var = 1/12*(1-0)^2)

## [1] 0.08333333

For uniform distributions with specific lower and upper bounds\(a\) and \(b\), we need to modify the call to runif().

a=-2
b=3
n=250
y<-runif(n,min=a,max=b) # generate a sample of size 250 from U(-2,3)
hist(y)# fat histogram

brks = seq(a,b,0.1)
hist(y, breaks = brks)# histogram with more bins

What is the next block of code supposed to do?

(mean_y = mean(y))

## [1] 0.5166937

(true_mean = (a+b)/2)

## [1] 0.5

(var_y = var(y))

## [1] 1.927141

(true_var = 1/12*(b-a)^2)

## [1] 2.083333

2.2 Normal \(N(\mu,\sigma^2)\)

Normally distributed numbers can be generated by rnorm(), which accepts three parameters: n, mean, and sd. The default parameter values are mean=0 and sd=1.

rnorm(1) # generate a single number from N(0,1)

## [1] 0.3703478

z<-rnorm(500) # generate a sample of size 500 from N(0,1)
hist(z)

brks = seq(min(z),max(z),(max(z)-min(z))/20) # what does this line do?
hist(z, breaks = brks)

For normal distributions with specific mean \(\mu\) and standard deviation \(\sigma\), we need to modify the call to rnorm().

w<-rnorm(5000, sd=3, mean=-2)
mean(w)

## [1] -2.087508

sd(w)

## [1] 2.980652

brks = seq(min(w),max(w),(max(w)-min(w))/50) 
hist(w, breaks = brks)

2.3 Poisson \(P(\lambda)\)

You won’t be surprised to find out that things are pretty similar for the Poisson distribution \(P(\lambda)\), except that we use the function rpois(), with required parameters n and lambda.

rpois(1,lambda=13)

## [1] 19

u<-rpois(500,lambda=13)
head(u)

## [1]  8 11 15 11 11 11

mean(u)

## [1] 13.182

var(u)

## [1] 12.48986

hist(u)

2.4 Binomial \(B(n,p)\)

Would you believe that rbinom() does the trick? Parameters are n, size, and prob. You can probably guess what the next two block do.

rbinom(1, size=11, prob=0.2)

## [1] 1

v<- rbinom(1000,size=11, prob=0.2)
mean(v)

## [1] 2.144

var(v)

## [1] 1.654919

brks = min(v):max(v) 
hist(v, breaks = brks)

# let's try another one
v<- rbinom(1000,size=19, prob=0.7)
mean(v)

## [1] 13.348

var(v)

## [1] 3.966863

brks = min(v):max(v) 
hist(v, breaks = brks)

2.5 Log-Normal \(\Lambda(\mu,\sigma^2)\)

rlnorm()? rlnorm(). With parameters n, meanlog, sdlog (whose default values are 0 and 1, respectively).

rlnorm(1)

## [1] 2.233227

s <- rlnorm(250,meanlog=0,sdlog=1)
hist(s)

2.6 Exp \(E(r)\)

Exponentially distributed numbers are generated by rexp(), with required parameters n and rate.

rexp(1,100)

## [1] 0.004335029

q<-rexp(1000,100)
mean(q)

## [1] 0.009842063

var(q)

## [1] 0.0001046535

hist(q)

2.7 Gamma \(\Gamma(r,k)\)

Gamma distributed numbers are generated by rgamma(), with required parameters n, shape, and scale.

rgamma(1,shape=2,scale=1/3)

## [1] 0.2865575

q<-rgamma(1000,shape=2, scale=1/3)
mean(q)

## [1] 0.6807423

var(q)

## [1] 0.2262379

hist(q)

2.8 Joint Normal

We can generate a multivariate joint normal via MASS’s mvrnorm(), whose required paramters are n, a mean vector mu and a covariance matrix Sigma.

# let's start with a standard bivariate joint normal
mu = rep(0,2)
Sigma = matrix(c(1,0,0,1),2,2)

We’ll sample 1000 observations from the joint normal \(N(\mu,\Sigma)\).

library(MASS)
a<-mvrnorm(1000,mu,Sigma)
a<-data.frame(a)
str(a)

## 'data.frame':    1000 obs. of  2 variables:
##  $ X1: num  -0.307 0.923 0.651 -2.111 -0.158 ...
##  $ X2: num  0.5828 -0.8599 -0.0108 -1.4353 -1.2544 ...

# let's plot the data... what would you expect to see here?
library(ggplot2)
library(hexbin)
qplot(X1, X2, data=a, geom="hex")

The covariance matrix was diagonal, so we expect theblob to be circular. What happens if we use a non-diagonalcovariance matrix?

mu = c(-3,12)
Sigma = matrix(c(110,15,15,3),2,2)
a<-mvrnorm(1000,mu,Sigma)
a<-data.frame(a)
qplot(X1, X2, data=a, geom="hex") + ylim(-40,40) + xlim(-40,40)

2.9 Hospital Night Shift

A hospital needs to schedule night shifts in the maternity ward.

It is known that there are \(3000\) deliveries per year; if these happened randomly round the clock (is this a reasonable assumption?), we would expect \(1000\) deliveries between the hours of midnight and 8.00 a.m., a time when much of the staff is off-duty.

It is thus important to ensure that the night shift is sufficiently staffed to allow the maternity ward to cope with the workload on any particular night, or at least, on a high proportion of nights.

The average number of deliveries per night is \(\lambda = 1000/365.25\approx 2.74\).

If the daily number \(X\) of night deliveries follows a Poisson process \(\mathcal{P}(\lambda)\), what is the probability of delivering \(x=0,1,2,\ldots\) babies on each night?

Solution: we are looking for \[f(x)=P(X=x)=\frac{\lambda^x\exp(-x)}{x!},\] for \(x=0,1,2,\ldots\).

For a Poisson distribution, the probability mass values \(f(x)\) can be obtained using dpois() (for a general distribution, replace the r in the rxxxxx(...) random number generators by d: dxxxxx(...)).

lambda = 2.74 # Poisson distribution parameter
x=0:10 # in theory, goes up to infinity, but we've got to stop somewhere...
y=dpois(x,lambda) # pmf
z=ppois(x,lambda) # cmf

(prob.details=data.frame(x,y,z))

##     x            y          z
## 1   0 0.0645703469 0.06457035
## 2   1 0.1769227505 0.24149310
## 3   2 0.2423841682 0.48387727
## 4   3 0.2213775403 0.70525481
## 5   4 0.1516436151 0.85689842
## 6   5 0.0831007011 0.93999912
## 7   6 0.0379493202 0.97794844
## 8   7 0.0148544482 0.99280289
## 9   8 0.0050876485 0.99789054
## 10  9 0.0015489063 0.99943945
## 11 10 0.0004244003 0.99986385

# pdf plot
plot(x,y, type="h", col=2, main="Poisson PMF", xlab="x", ylab="f(x)=P(X=x)")> points(x,y, col=2)

## logical(0)

abline(h=0, col=4)

# cmf  plot
plot(c(1,x),c(0,z), type="s", col=2, main="Poisson CMF", xlab="x", ylab="F(x)=P(X<=x)")
abline(h=0:1, col=4)

If the maternity ward wants to prepare for the greatest possible traffic on \(80\%\) of the nights, how many deliveries should be expected?

Solution: we seek an \(x\) for which \[P(X\leq x-1)\leq 0.80\leq P(X\leq x);\] a visual display simplifies the task.

# cmf  plot
plot(c(1,x),c(0,z), type="s", col=2, main="Poisson CMF", xlab="x", ylab="F(x)=P(X<=x)")
abline(h=0:1, col=4)
abline(h=0.8, col=1)

The \(y=0.8\) line crosses the CMF at \(x=4\); let’s evaluate \(F(3)=P(X\leq 3)\) and \(F(4)=P(X\leq 4)\) to confirm that \(F(3)\leq 0.8 \leq F(4)\).

lambda=2.74
ppois(3,lambda)

## [1] 0.7052548

ppois(4,lambda)

## [1] 0.8568984

It is indeed the case. Thus, if the hospital prepares for \(4\) deliveries a night, they will be ready for the worst on at least \(80\%\) of the nights (closer to \(85.7\%\), actually).

Note that this is different than asking how many deliveries are expected nightly (namely, \(E[X]=2.74\)).

On how many nights in the year would \(5\) or more deliveries be expected?

Solution: we need to evaluate \[ 365.25\cdot P(X\ge 5)=365.25 (1-P(X\le 4)).\]

365.25*(1-ppois(4,2.74))

## [1] 52.26785

That is, roughly \(14\%\) of the nights.

Over the course of one year, what is the greatest number of deliveries expected on any night?

Solution: we are looking for largest value of \(x\) for which \(365.25\cdot P(X=x)\geq 1\) (if \(365.25\cdot P(X=x)<1\), then the probability of that number of deliveries is too low to expect that we would ever see it during the year).

lambda=2.74

# initializing vector
nights=c() 

# expected number of nights with each number of deliveries
for(j in 0:10){
  nights[j+1]=365.25*dpois(j,lambda)
}
rbind(0:10,nights)

##            [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]
##         0.00000  1.00000  2.00000  3.00000  4.00000  5.00000  6.00000
## nights 23.58432 64.62103 88.53082 80.85815 55.38783 30.35253 13.86099
##            [,8]     [,9]    [,10]      [,11]
##        7.000000 8.000000 9.000000 10.0000000
## nights 5.425587 1.858264 0.565738  0.1550122

# identify largest index
max(which(nights>1))-1

## [1] 8

Namely, \(x=8\). Indeed, for larger values of \(x\), \(365.25\cdot P(X=x)<1\).

lambda=2.74
365.25*dpois(8,lambda)

## [1] 1.858264

365.25*dpois(9,lambda)

## [1] 0.565738

3. CENTRAL LIMIT THEOREM

In this section, we illustrate how to create the Central Limit Theorem to answer questions.

3.1 Freight Elevators

A large freight elevator can transport a maximum of 9800 lbs. Suppose a load containing 49 boxes must be transported. From experience, the weight of boxes follows a distribution with mean \(\mu=205\) lbs and standard deviation \(\sigma=15\) lbs.

Estimate the probability that all 49 boxes can be safely loaded onto the freight elevator and transported.

Solution: we are given \(n=49\), \(\mu=205\), and \(\sigma=15\). Let’s assume further that the boxes all come from different sources (i.e. the boxes’ weight \(x_i\), \(i=1,\ldots,49\), are independent of one another)

To get a sense of the task’s feasibility, let’s simulate a few scenarios. You’ll notice that the problem makes no mention of the type of distribution followed by the weights.

For now, let us assume that the weights are normally distributed.

set.seed(0) # to ensure replicability; the seed only applies to the next call requiring the pseudo-random generator
x<-rnorm(49,mean=205,sd=15)

brks = seq(min(x),max(x),(max(x)-min(x))/10) 
hist(x, breaks = brks)

The elevator can transport up to 9800 lbs; the \(n=49\) boxes can be transported if their total weight \[T=49w=x_1+\cdots +x_{49},\] where \(w=\overline{x}\), is less than 9800 lbs. In mathematical terms, we’re interested in the probability \(P(T<9800)\).

For the sample x from above, we get:

(T<-sum(x))

## [1] 10066.36

and so that specific group of 49 boxes would be too heavy to carry in one trip. Perhaps we were simply unlucky – perhaps another group of boxes would have been light enough. Let’s try the same thing, but with a different group of boxes.

set.seed(999) 
(T=sum(rnorm(49,mean=205,sd=15)))

## [1] 9852.269

Close, but no cigar. However, two tries aren’t enough to establish a trend and to estimate \(P(T<9800)\).

We’ll write a little function to help us find an estimate. The idea is simple: if we were to try a large number of random combinations of 49 boxes, the proportion of the attempts for which the total weight \(T\) falls below 9800 is (hopefully?) going to approximate \(P(T<9800)\).

# see if you can figure out what kind of inputs these are meant to be, and what this code does
# running this cell will compile the function, but 
# it still needs to be called with appropriate inputs to provide an estimate for P(T<9800)

estimate_T.normal <- function(n, T.threshold, mean, sd, num.tries){
a=0
for(j in 1:num.tries){
if(sum(rnorm(n,mean=mean,sd=sd))<T.threshold){
a=a+1
}
}
estimate_T.normal <- a/num.tries
}

We’ll try the experiment 10, 100, 1000, 10000, 100000, and 1000000 times.

(c(estimate_T.normal(49,9800,205,15,10),
estimate_T.normal(49,9800,205,15,100),
estimate_T.normal(49,9800,205,15,1000),
estimate_T.normal(49,9800,205,15,10000),
estimate_T.normal(49,9800,205,15,100000),
estimate_T.normal(49,9800,205,15,1000000)))

## [1] 0.00000 0.01000 0.00700 0.00990 0.00973 0.00975

We can’t say too much from such a simple set up, but it certainly seems as though we could expect success about \(1\%\) of the time.

That’s low, but perhaps that was because we assumed normality. What would happen if we used other distributions with the same characteristics, such as \(U(179.0192379,230.9807621)\) or \(\Lambda(5.320340142,0.005339673624)\)?

(How would you verify that these distributions indeed have the right characteristics? How would you determine the appropriate parameters?)

Let’s try with those distributions as well.

estimate_T.unif <- function(n, T.threshold, min, max, num.tries){
a=0
for(j in 1:num.tries){
if(sum(runif(n,min=min,max=max))<T.threshold){
a=a+1
}
}
estimate_T.unif <- a/num.tries
}


(c(estimate_T.unif(49,9800,179.0192379,230.9807621,10), 
 estimate_T.unif(49,9800,179.0192379,230.9807621,100),
 estimate_T.unif(49,9800,179.0192379,230.9807621,1000),
 estimate_T.unif(49,9800,179.0192379,230.9807621,10000),
 estimate_T.unif(49,9800,179.0192379,230.9807621,100000),
 estimate_T.unif(49,9800,179.0192379,230.9807621,1000000)))

## [1] 0.000000 0.010000 0.008000 0.007900 0.010230 0.009613

estimate_T.lnorm <- function(n, T.threshold, meanlog, sdlog, num.tries){
a=0
for(j in 1:num.tries){
if(sum(rlnorm(n,meanlog=meanlog,sdlog=sdlog))<T.threshold){
a=a+1
}
}
estimate_T.lnorm <- a/num.tries
}

(c(estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),10), 
 estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),100),
 estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),1000),
 estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),10000),
 estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),100000),
 estimate_T.lnorm(49,9800,5.320340142,sqrt(0.005339673624),1000000)))

## [1] 0.000000 0.000000 0.006000 0.009500 0.009060 0.009184

Under all three distributions, it appears as though \(P(T<9800)\) converges to a value near \(1\%\), even though the three distributions are very different. That might be surprising at first glance, but it isn’t really, given the Central Limit Theorem (CLT).

In effect, we are interested in estimating \(P(T<9800)=P(w<9800/49)=P(w<200)\), where \(w\) is the mean weight of the boxes.

According to the CLT, the distribution of \(w\) is approximately normal with mean \(\mu=205\) and variance \(\sigma^2/n=15^2/49\), even if the weights themselves were not normally distributed.

By subtracting the mean of \(w\) and dividing by the standard deviation we obtain a new random variable \(z\) which is approximately the standard unit normal, i.e. \[P(w<200)\approx P\left(z<\frac{200−205}{15/7}\right).\]

(200-205)/(15/7)

## [1] -2.333333

Thus, \(P(w<200)\approx P(z<−2.33)\) and we need to find the probability that the standard unit normal pdf is smaller than \(-2.33\).

This can be calculated with the pnorm() function:

pnorm(-2.33, mean=0, sd=1)

## [1] 0.009903076

Hence, \(P(T<9800)\approx 0.0099\), which means that it’s highly unlikely (\(\approx 1.1\%\) probability) that the 49 boxes can be transported in the elevator all at once.

What elevator threshold would be required to reach a probability of success of \(10\%\)? \(50\%\)? \(75\%\)?

The following routine approximates the probability in question without resorting to simulating the weights (that is, independently of the underlying distribution of weights) for given n, threshold, mean, and sd. Can you figure out what pnorm() is doing?

prob_T <- function(n,threshold,mean,sd){
prob_T=pnorm((threshold/n - mean)/(sd/sqrt(n)),0,1)
}

max(which(prob_T(49,1:12000,205,15)<0.1))

## [1] 9910

max(which(prob_T(49,1:12000,205,15)<0.5))

## [1] 10044

max(which(prob_T(49,1:12000,205,15)<0.75))

## [1] 10115

plot((prob_T(49,1:12000,205,15)))

4. REGRESSION

In this section, we show howto run and interpret the results of linear regression models.

4.1 Auto Parts

A specific auto part is manufactured by a company once a month in lots that vary in size as demand fluctuates. The data below represent observations on lot size \(y\) and number of employee-hours of labor \(x\) for 10 recent production runs.

Fit a simple regression model \(y_i=\beta_0+\beta_1x_i+\varepsilon_i\), where \(E(\varepsilon_i)=0\), \(E(\varepsilon_i\varepsilon_j)=0\) for \(i\neq j\), and \(V(\varepsilon_i)=\sigma^2\), if the observations are: \[\begin{align*}\mathbf{Y}&=[73,50,128,170,87,108,135,69,148,132]^{\!\top} \\ \mathbf{x}&=[30,20,60,80,40,50,60,30,70,60]^{\!\top} \end{align*}\]

Solution: the line of best fit can be obtained via the design matrix \(\mathbf{X}=[\mathbf{1}\ \mathbf{x}]\), which allows us to compute the quantities required to solve the canonical equation: \((\mathbf{X}^{\!\top}\mathbf{X})^{-1}\), \(\mathbf{X}^{\!\top}\mathbf{Y}\), and \(\mathbf{\hat{\beta}}=(\mathbf{X}^{\!\top}\mathbf{X})^{-1}\mathbf{X}^{\!\top}\mathbf{Y}\).

X=cbind(matrix(rep(1,10),10,1),matrix(c(30,20,60,80,40,50,60,30,70,60),10,1))
Y=matrix(c(73,50,128,170,87,108,135,69,148,132),10,1)
t(X)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    1    1    1    1    1    1    1    1     1
## [2,]   30   20   60   80   40   50   60   30   70    60

t(Y)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]   73   50  128  170   87  108  135   69  148   132

beta=solve(t(X)%*%X)%*%t(X)%*%Y
beta

##      [,1]
## [1,]   10
## [2,]    2

The line of best fit (in the ordinary least square sense) is thus \(y=10+2x\).

data.set = data.frame(cbind(X[,2],Y))
colnames(data.set)<-c("X","Y")
str(data.set)

## 'data.frame':    10 obs. of  2 variables:
##  $ X: num  30 20 60 80 40 50 60 30 70 60
##  $ Y: num  73 50 128 170 87 108 135 69 148 132

plot(data.set)
abline(a=beta[1], b=beta[2], col="red")

Of course, no one computes the line of best fit explicitly in practice.

Once you’ve convinced yourself that the OLS framework is valid for your data, it makes much more sense to use the built-in functionality, such as is offered by lm().

It’s not the only such method; we’re presenting it mostly to hammer homethe point that the R syntax for predictive models usually follows the same format:

a method produces an object (the model),
the method call requires the user to pass on a model structure (y ~ predictors),
to specify the dataset, and
a number of optional parameters.

best.fit = lm(Y~X, data=data.set)
best.fit

## 
## Call:
## lm(formula = Y ~ X, data = data.set)
## 
## Coefficients:
## (Intercept)            X  
##          10            2

We see quickly that the answer that was computed directly above is the same as the one provided by lm(), which is good news, but we can get more information via the summary() function.

summary(best.fit)

## 
## Call:
## lm(formula = Y ~ X, data = data.set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -3.0   -2.0   -0.5    1.5    5.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.00000    2.50294   3.995  0.00398 ** 
## X            2.00000    0.04697  42.583 1.02e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.739 on 8 degrees of freedom
## Multiple R-squared:  0.9956, Adjusted R-squared:  0.9951 
## F-statistic:  1813 on 1 and 8 DF,  p-value: 1.02e-10

4.2 Mileage

(Question 4.7.11 in An Introduction to Statistical Learning, with Applications in R, pp. 171-2.)

This problem involves the ISLR’s Auto dataset.

library(ISLR)
str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

summary(Auto)

##       mpg          cylinders      displacement     horsepower   
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##                                                                 
##      weight      acceleration        year           origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2978   Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##                  name    
##  amc matador       :  5  
##  ford pinto        :  5  
##  toyota corolla    :  5  
##  amc gremlin       :  4  
##  amc hornet        :  4  
##  chevrolet chevette:  4  
##  (Other)           :365

Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median.

Solution: there are many ways to do this – one such way is to use the tidyverse’s pipeline operator %>%.

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✓ tibble  3.0.1     ✓ purrr   0.3.4
## ✓ tidyr   1.0.2     ✓ dplyr   0.8.5
## ✓ readr   1.3.1     ✓ stringr 1.4.0
## ✓ tibble  3.0.1     ✓ forcats 0.4.0

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x psych::%+%()    masks ggplot2::%+%()
## x psych::alpha()  masks ggplot2::alpha()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()

Auto.mpg01 <- Auto %>% 
mutate(mpg01 = ifelse(mpg > median(mpg), 1, 0))

Auto.mpg01$mpg01 <- factor(Auto.mpg01$mpg01)
Auto.mpg01$origin <- factor(Auto.mpg01$origin)
str(Auto.mpg01)

## 'data.frame':    392 obs. of  10 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  $ mpg01       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01?

Solution: here are a number of charts for the Auto.mpg01 dataset.

We start with a scatterplot matrix that uses GGally’s ggscatmat().

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

ggscatmat(as.data.frame(Auto.mpg01), columns = 1:8, 
    color = "mpg01")

## Warning in ggscatmat(as.data.frame(Auto.mpg01), columns = 1:8, color =
## "mpg01"): Factor variables are omitted in plot

The correlations are easier to spot using the corrplot() function.

library(corrplot)

## corrplot 0.84 loaded

corrplot(cor(Auto.mpg01[,c(1:7)]),type="upper")

We could produce many more plots, but let’s stick to boxplots for now.

Auto.mpg01 <- Auto.mpg01[,-1]

Auto.mpg01 %>% 
ggplot(aes(x=mpg01, displacement)) + 
geom_boxplot()

Auto.mpg01 %>% 
ggplot(aes(mpg01, horsepower)) + 
geom_boxplot()

Auto.mpg01 %>% 
ggplot(aes(mpg01, weight)) + 
geom_boxplot()

Auto.mpg01 %>% 
ggplot(aes(mpg01, acceleration)) + 
geom_boxplot()

ggplot(Auto.mpg01) +
geom_count(aes(mpg01, cylinders))

ggplot(Auto.mpg01) +
geom_boxplot(aes(mpg01, cylinders))

The correlation between mpg01 and acceleration, year is positive, whereas the correlation between mpg01 and cylinders, displacement, horsepower, and weight is negative.

(Note that the origin is treated as a numeric variable due to the way it was encoded, but it should really be treated as a categorical variable.)

Split the data into a training set and a test set, and perform logistic regression on the training data in order to predict mpg01 for the test observations using the variables that seemed most associated with mpg01 in the previous question. What is the test error of the model obtained?

Solution: we will use a \(70\%/30\%\) split.

# 70%/30% split 
set.seed(31415) # for replicability

tmp <- Auto.mpg01 %>% mutate(id = row_number()) # we create a new variable id corresponding to the row number
Auto.train <- tmp %>% sample_frac(.70)
Auto.test  <- anti_join(tmp, Auto.train, by = 'id')

nrow(Auto.train)

## [1] 274

nrow(Auto.test)

## [1] 118

We start by running one logistic regression analysis (using glm(..., family = binomial)) with the variable year included in the set of predictors (cylinders, displacement, horsepower, weight, acceleration, year), and one without (strictly speaking, we should probably create a categorical variable with levels Old and New instead of the numerical year values, but that’s a problem for another day).

Auto.log <- glm(mpg01 ~ cylinders + displacement + 
    horsepower + weight + acceleration + year,
    data = Auto.train,
    family = binomial)
summary(Auto.log)

## 
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower + 
##     weight + acceleration + year, family = binomial, data = Auto.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3502  -0.1216  -0.0009   0.2216   3.1910  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -10.427487   6.616579  -1.576   0.1150    
## cylinders     -0.532987   0.482273  -1.105   0.2691    
## displacement   0.004309   0.011357   0.379   0.7044    
## horsepower    -0.039762   0.028604  -1.390   0.1645    
## weight        -0.003979   0.001281  -3.105   0.0019 ** 
## acceleration  -0.013896   0.173644  -0.080   0.9362    
## year           0.364917   0.085721   4.257 2.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 379.83  on 273  degrees of freedom
## Residual deviance: 112.38  on 267  degrees of freedom
## AIC: 126.38
## 
## Number of Fisher Scoring iterations: 8

Only the variables weight, year and horsepower are significant at the \(\alpha=0.1\) level (we’re still going to use all of the variables to make the predictions, however).

The “probability” predictions on the test data are obtained using the predict() function:

# the probability of predicting mpg01 = 0 
Auto.log.probs <- predict(Auto.log, Auto.test, 
    type = "response")

Auto.log.probs

##            1            2            3            4            5 
## 8.223165e-04 5.175335e-04 1.533244e-06 2.095788e-06 2.065226e-04 
##            6            7            8            9           10 
## 2.468985e-01 2.940733e-01 7.595980e-01 6.530110e-06 4.233838e-05 
##           11           12           13           14           15 
## 5.248310e-05 4.402751e-07 4.965142e-02 7.763576e-01 8.499461e-01 
##           16           17           18           19           20 
## 2.386803e-05 1.475026e-06 1.613606e-04 1.562219e-04 1.101385e-01 
##           21           22           23           24           25 
## 9.186736e-01 8.944352e-01 8.857737e-01 2.068811e-04 3.997363e-02 
##           26           27           28           29           30 
## 1.600116e-01 6.396976e-06 7.681503e-06 2.194514e-01 7.524787e-01 
##           31           32           33           34           35 
## 8.553042e-01 3.278123e-06 6.186946e-01 1.854775e-01 7.433616e-02 
##           36           37           38           39           40 
## 2.174472e-04 1.073646e-01 2.036722e-01 4.605751e-02 3.703533e-05 
##           41           42           43           44           45 
## 9.836101e-01 9.605762e-01 9.625620e-01 9.443120e-01 9.327093e-02 
##           46           47           48           49           50 
## 2.933971e-01 1.522843e-05 4.468975e-05 3.004570e-05 7.982652e-01 
##           51           52           53           54           55 
## 9.644078e-01 3.809157e-01 8.998975e-01 9.672360e-01 1.216884e-01 
##           56           57           58           59           60 
## 7.360486e-02 9.952170e-01 9.782763e-01 9.254961e-01 1.997725e-01 
##           61           62           63           64           65 
## 3.706235e-05 1.269055e-04 9.760979e-01 9.936897e-01 9.943298e-01 
##           66           67           68           69           70 
## 9.810295e-01 9.979466e-01 6.451700e-02 3.804970e-02 1.108955e-02 
##           71           72           73           74           75 
## 9.105604e-01 9.609822e-01 9.235237e-01 1.078530e-01 9.952231e-01 
##           76           77           78           79           80 
## 6.306253e-01 8.366398e-01 9.022938e-03 4.911004e-03 1.785491e-03 
##           81           82           83           84           85 
## 9.974656e-01 9.966332e-01 9.639892e-02 6.114174e-01 9.519968e-01 
##           86           87           88           89           90 
## 9.950282e-01 9.985572e-01 9.963954e-01 9.934700e-01 9.783528e-01 
##           91           92           93           94           95 
## 9.985216e-01 9.959005e-01 7.964433e-01 9.988975e-01 3.331308e-01 
##           96           97           98           99          100 
## 9.707782e-01 9.924896e-01 9.673216e-01 7.835725e-01 9.897617e-01 
##          101          102          103          104          105 
## 9.973371e-01 9.925931e-01 9.788452e-01 7.840992e-01 5.817979e-01 
##          106          107          108          109          110 
## 2.221020e-01 3.786351e-01 9.729914e-01 9.819340e-01 9.645347e-01 
##          111          112          113          114          115 
## 9.988681e-01 9.988207e-01 9.981796e-01 9.972075e-01 9.991459e-01 
##          116          117          118 
## 9.619199e-01 9.989901e-01 9.713495e-01

We use a probability threshold of \(0.5\) for the class predictions (i.e. if probability is smaller than 0.5, predict that mpg01 is 0, else predict that it is 1).

# setting up the categorical response
Auto.log.preds <- rep(0,length(Auto.test$mpg01))
Auto.log.preds[Auto.log.probs > 0.5] <- 1

table(Auto.log.preds)

## Auto.log.preds
##  0  1 
## 53 65

We can now compare how often the prediction matches the reality and build a confusion matrix

table(Auto.log.preds, Auto.test$mpg01)

##               
## Auto.log.preds  0  1
##              0 50  3
##              1  8 57

with accuracy \(\frac{58+48}{58+48+3+9}=89.8\%\), error rate \(\frac{3+9}{58+48+3+9}=10.2\%\), true negative rate \(\frac{58}{58+9}=86.6\%\) and true positive rate \(\frac{48}{48+3}=94.1\%\) (48/(48+3)).

mean(Auto.log.preds == Auto.test$mpg01)

## [1] 0.9067797

1 - mean(Auto.log.preds == Auto.test$mpg01)

## [1] 0.09322034

The second model doesn’t use the variable year.

# without the variable year
Auto.log <- glm(mpg01 ~ cylinders + displacement 
    + horsepower + weight + acceleration,
    data = Auto.train,
    family = binomial)
summary(Auto.log)

## 
## Call:
## glm(formula = mpg01 ~ cylinders + displacement + horsepower + 
##     weight + acceleration, family = binomial, data = Auto.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5472  -0.1501  -0.0036   0.3566   3.3249  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  14.264874   3.568420   3.998  6.4e-05 ***
## cylinders    -0.546774   0.433090  -1.262   0.2068    
## displacement -0.001175   0.010203  -0.115   0.9083    
## horsepower   -0.047321   0.025437  -1.860   0.0628 .  
## weight       -0.002187   0.001093  -2.000   0.0455 *  
## acceleration -0.032208   0.156340  -0.206   0.8368    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 379.83  on 273  degrees of freedom
## Residual deviance: 137.49  on 268  degrees of freedom
## AIC: 149.49
## 
## Number of Fisher Scoring iterations: 7

Only the variables displacement and horsepower are significant at the \(\alpha=0.1\) level (we’re still going to use all of the variables to make the predictions, however).

The “probability” predictions on the test data are:

Auto.log.probs <- predict(Auto.log, Auto.test, 
    type = "response")
Auto.log.preds <- rep(0,length(Auto.test$mpg01))
Auto.log.preds[Auto.log.probs > 0.5] <- 1

table(Auto.log.preds)

## Auto.log.preds
##  0  1 
## 54 64

The confusion matrix is exactly the same as before (that won’t always be the case, it depends on the seed used to split the training and testing data), with the same

table(Auto.log.preds, Auto.test$mpg01)

##               
## Auto.log.preds  0  1
##              0 49  5
##              1  9 55

mean(Auto.log.preds == Auto.test$mpg01)

## [1] 0.8813559

1 - mean(Auto.log.preds == Auto.test$mpg01)

## [1] 0.1186441

Even though the model with the year predictor has the same performance metrics on the test set as the one without, it is difficult to justify the use of year as a numerical variable in the model.

Consequently, we select the second model for any future prediction endeavour.

(I suppose that we could also create a model with a categorical age variable, but my interest in this dataset is waning even as I type this).

4.3 Boston

(Question 3.7.15 in An Introduction to Statistical Learning, with Applications in R, p. 126.)

This problem involves the MASS’s Boston dataset.

library(MASS)
Boston$chas <- factor(Boston$chas, labels = c("N","Y"))
attach(Boston)
str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    N 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    N 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    N 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    N 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    N 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    N 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

We are interested in the relationship between the per capita crime rate (response \(Y\)) and the other variables (predictors \(X_j\)).

For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

Solution: we build simple linear regression models \[\hat{Y}=\beta_{0,j}+\beta_jX_j\] for each predictor \(X_j\) in the dataset, except for \(j=1\), for which we build \(\hat{Y}=\beta_{0,1}\).

One of the ways in which we can determine if we reject the null hypothesis \(H^j_0:\beta_{0,j}=0\) for the \(j\)th model is to see if the associated \(p-\)value is below some significance threshold, say \(\alpha=0.05\).

To do so, we fit each of the 14 models (one for each simple regression models), then we extract the \(p-\)value corresponding to the null hypothesis for each model, and we look for those values that are above \(\alpha=0.05\) as being predictors for which we cannot reject the possibility that there is no relation between the predictor and the response crim.

Since some of the \(p-\)values are quite small, we plot their logarithms instead, and we look for predictors for which \(\ln(p-\text{value})>-3\), which correspond to a \(p-\)value greater than \(\approx 0.05\).

# initialize model, pval, and coeff objects
model=list()
pval=c()
coeff=c()

# build the first model (constant response), and extract the corresponding p-value and coefficient
model[[1]] <- lm(crim ~ 1)
pval[1] <- summary(model[[1]])$coefficients[1,4]
coeff[1] <- coefficients(model[[1]])[1]

model

## [[1]]
## 
## Call:
## lm(formula = crim ~ 1)
## 
## Coefficients:
## (Intercept)  
##       3.614

pval

## [1] 1.262593e-19

coeff

## [1] 3.613524

# repeat the procedure for each of the other predictors
for(i in 1:(ncol(Boston)-1)){
  model[[i+1]] <- lm(crim ~., data=Boston[,c(1,i+1)])
  pval[i+1] <- summary(model[[i+1]])$coefficients[2,4]
  coeff[i+1] <- coefficients(model[[i+1]])[2]
}

# combine results
results<-data.frame(pval)
rownames(results)<-colnames(Boston)

# plot p-values
plot(log(pval), pch=20, ylim=c(-140, 0), 
     xaxt="n", xlab="Predictor", ylab="log(pval)")

# Plot the axis separately
axis(1, at=1:14, labels=rownames(results))
abline(h=-3,col=2)

The only predictor for which we cannot conclude that there is a statistically significant association with the response is chas (\(X_3\)), with \(p-\)value \(\approx 0.21\).

# p-value of predictor X3
pval[4]

## [1] 0.2094345

Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis \(H_0:\beta_j=0\)?

Solution: the output for the full regression model \[\hat{Y}=\beta_1+\sum_{j=2}^{14}\beta_jX_j\] is shown below.

# building the full multiple regression model
lm.full = lm(crim~., data=Boston)
summary(lm.full)

## 
## Call:
## lm(formula = crim ~ ., data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chasY        -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

# diagnostic plots
par(mfrow=c(2,2))
plot(lm.full)

In the full model, we can only reject the null hypothesis \(H_0:\beta_j=0\) at \(\alpha=0.05\) zn, dis, rad, black, and medv.

The diagnostic plots for the full model are… well, does it look to you as though the various assumptions required to use a linear model are met?

How do the results of the simple linear regressions compare to those of the full linear regression?

Solution: we create a plot showing the relationship between the simple regression coefficients and the multiplie regression coefficients (each predictor is displayed as a single point on the plot).

What would you expect to see here?

# multiple regression coefficient
y=coefficients(lm.full)[2:14]

# recall that the simple regression coefficients were stored in coeff
par(mfrow=c(1,1))
plot(coeff[2:14], y, xlab="Simple Regression Coeffients", ylab="Multiple Regression Coefficients")

In most cases, the coefficient from the simple regression is fairly close to the corresponding coefficient from the multiple regression, except for one predictor (bottom right corner). Which predictor is this?

colnames(Boston)[which(coeff[2:14]> 30)+1]

## [1] "nox"

Let’s zoom in on the upper left corner to get a better sense of the relationship among the non-nox predictors.

plot(coeff[-c(1,5)],y[-c(4)], xlab="Simple Regression Coeffients", ylab="Multiple Regression Coefficients")

We can see that there is still a fair amount of variation, save for a small group of predictors for which the simple regression coefficients lie between \(-0.5\) and \(0.75\) (eyeballing things, here).

colnames(Boston)[(which(coeff[2:14]>-1 & coeff[2:14]<= 0.55))+1]

## [1] "zn"    "indus" "age"   "tax"   "black" "lstat" "medv"

The predictors for which both coefficients are “mostly” the same are thus zn, indus, age, tax, black, lstat, and medv.

Is there evidence of non-linear association between any of the predictors and the response?

Solution: to answer this question, we will fit, for every predictor \(X_j\), a cubic polynomial model

\[\hat{Y}=\beta_{0,j}+\beta_{1,j}X_j+\beta_{2,j}X_j^2+\beta_{3,j}X_j^3, \] except for \(X_j=\) (chas) as polynomial regression is non-sensical for categorical features.

We are interested in finding associated \(p-\)values below \(0.05\) for the higher order terms, say, for each model \(j\), in order to reject the null hypotheses \[H_{0,j}: \beta_{2,j}=\beta_{3,j}=0 \quad \text{against} H_{1,j}: \beta_{2,j}^2+\beta_{3,j}^2\neq 0\] in the \(j\)th model.

To do so, we repeat the procedure of the first question.

# remove chas from the dataset
Boston2 <-Boston[,-c(4)]
detach(Boston)
attach(Boston2)

# initialize model, pval, and p.array (higher degree) objects
model=list()
pval=list()
p.array=matrix(rep(NA,26),nrow=13,ncol=2)

# build the first model (constant response), and extract the corresponding p-value 
model[[1]] <- lm(crim ~ 1)
pval[[1]] <- summary(model[[1]])$coefficients[1,4]

# repeat the procedure for each of the other predictors
for(i in 1:(ncol(Boston2)-1)){
  model[[i+1]] <- lm(crim ~ poly(Boston2[,c(i+1)],3),
                    data=Boston[,c(1,i+1)])
  pval[[i+1]] <- summary(model[[i+1]])$coefficients[3:4,4]
  p.array[i+1,1] <- as.numeric(pval[[i+1]][1]) # quadratic term
  p.array[i+1,2] <- as.numeric(pval[[i+1]][2]) # cubic term
}

rownames(p.array)<-c("crim","zn","indus","nox","rm",
                     "age","dis","rad","tax","ptratio",
                     "black","lstat","medv")
p.array<-p.array[2:13,]

We find evidence of a non-linear association between crim and all predictors, except for black (and chas, of course).

rowSums(p.array< 0.05)

##      zn   indus     nox      rm     age     dis     rad     tax ptratio 
##       1       2       2       1       2       2       1       1       2 
##   black   lstat    medv 
##       0       1       2

detach(Boston2)

4.4 ANOVA

Suppose that a researcher wants to determine if, as she believes, a new teaching method enables students to understand elementary statistical concepts better than the traditional lectures given in a university setting.

She recruits \(N=80\) second-year students to test her claim. The students are randomly assigned to one of two groups:

students in group \(A\) are given the traditional lectures,
whereas students in group \(B\) are taught using the new teaching method.

After three weeks, a short quiz is administered to the students in order to assess their understanding of statistical concepts.

The results are found in the teaching.csv dataset.

teaching <- read.csv("Data/teaching.csv", header = TRUE)
str(teaching)

## 'data.frame':    80 obs. of  3 variables:
##  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: Factor w/ 2 levels "A","B": 2 2 1 1 2 1 2 2 2 2 ...
##  $ Grade: num  75.5 77.5 73.5 75 77 79 79 75.5 80 79.5 ...

summary(teaching)

##        ID        Group      Grade      
##  Min.   : 1.00   A:40   Min.   :68.50  
##  1st Qu.:20.75   B:40   1st Qu.:75.00  
##  Median :40.50          Median :77.25  
##  Mean   :40.50          Mean   :77.06  
##  3rd Qu.:60.25          3rd Qu.:79.12  
##  Max.   :80.00          Max.   :84.00

head(teaching)

##   ID Group Grade
## 1  1     B  75.5
## 2  2     B  77.5
## 3  3     A  73.5
## 4  4     A  75.0
## 5  5     B  77.0
## 6  6     A  79.0

ANOVA is a statistical method that partitions a dataset’s variability into explainable variability (model-based) and unexplained variability (error) using various statistical models, to determine whether (multiple) treatment groups have significantly different group means.

The total sample variability of a feature \(y\) in a dataset is defined as \[ \text{SS}_{\textrm{tot}}=\sum_{k=1}^{N}(y_{k}-\bar{y})^{2}, \] where \(\bar{y}\) is the overall mean of the data.

Plot the students’ scores, ordered by student ID, and the class mean \(\mu\), without reference to the assigned treatment group. Is there a pattern?

Solution: since the assignment of student ID is arbitrary when it comes to performance (at least, in theory), we should not observe any pattern.

# class mean
(mu = mean(teaching$Grade))

## [1] 77.0625

# scatterplot
plot(teaching$ID,teaching$Grade)

# add a horizontal line at the mean level
abline(h = mu)

If we had to guess someone’s score with no knowledge except for their participant ID, then picking the sample mean is as good a guess as any other.

Statistically speaking, this means that the null model is \[ y_{i,j}=\mu+\varepsilon_{i,j}, \] where \(\mu\) is the class mean, \(i\) refers to the method \(i= {A,B}\), and \(j=1,\ldots,40\) represent the students in each group.

This model does not explain any of the variability in the student scores other than through a noise term.

But the students DID NOT all receive the same treatment: \(40\) randomly selected students were assigned to group \(A\), and the other \(40\) to group \(B\), and both group were taught using a different method.

Plot the students’ scores, ordered by student ID, and the class mean \(\mu\), but identify the students that received each of the treatment. Is there a pattern?

# compute the means in each treatment group
means.by.group = aggregate(teaching[, 3], list(teaching$Group), mean)

# plot students from the two groups using different markers
plot(teaching$ID,teaching$Grade, 
     col=c("red","blue")[as.numeric(teaching$Group)], 
     pch=c(15,17)[as.numeric(teaching$Group)])

# add the class mean and the mean in each group to the plot
abline(h = mu)
abline(h = means.by.group[2,2], col="blue", lwd=2, lty=2)
abline(h = means.by.group[2,2], col="red", lwd=2, lty=2)

# add legend
legend("topleft", legend=c("Group A", "Group B"),
       col=c("red", "blue"), pch=c(15,17), cex=0.8)

We clearly see that the two study groups show different characteristics in term of their average scores.

With the group assignment information, we can refine our null model into the treatment-based model \[ y_{i,j}=\mu_{i}+\varepsilon_{i,j}, \] where \(\mu_i\), \(i={A,B}\) represent the group means.

Using this model, we can decompose \(\text{SS}_{\textrm{tot}}\) into between-treatment sum of squares and error (within-treatment) sum of squares as \[ \text{SS}_{\textrm{tot}}=\sum_{i,j}(y_{i,j}-\bar{y})^{2}=\sum_{i,j}(y_{i,j}-\bar{y}_{i}+\bar{y}_{i}-\bar{y})^{2} =\sum_{i}N_{i}(\bar{y}_{i}-\bar{y})^{2}+\sum_{i,j}(y_{i,j}-\bar{y}_{i})^2=\text{SS}_{\textrm{treat}}+\text{SS}_{\textrm{e}} \] The \(\text{SS}_{\textrm{treat}}\) component looks at the difference between each of the treatment means and the overall mean, which we consider to be explainable; the \(\text{SS}_{\textrm{e}}\) component, on the other hand, looks at the difference between each observation and its own group mean, and is considered to be random

The spread about each group mean is fairly large (relatively-speaking), so the researcher would be right to suggest that the treatment-based model does not capture all the variability in the data on its own, but we will not pursue this further.

In general, \(\text{SS}_{\textrm{treat}}/\text{SS}_{\textrm{tot}}\times 100\%\) of the total variability can be explained using a treatment-based model. This ratio is coefficient of determination, denoted by \(R^{2}\).

Find the ANOVA table for the treatment-based model. Does the evidence suggest that the \(2-\)treatment model is significant?

Solution: we can recover the ANOVA table in R by first running a linear regression on the data.

# run the linear regression
model.lm <- lm(Grade ~ Group, data = teaching)

# compute the ANOVA table
SS.Table <- anova(model.lm)
SS.Table

## Analysis of Variance Table
## 
## Response: Grade
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Group      1 300.31 300.312  49.276 7.227e-10 ***
## Residuals 78 475.38   6.095                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test statistic \(F_{0}\) follows an \(F\)-distribution with \[(\text{df}_{\textrm{treat}}, \text{df}_{\textrm{e}})=(1,n_1+n_2-2)=(1,78)\] degrees of freedom.

At a significance level of \(\alpha=0.05\), the critical value \(F^*=F_{0.95, 1, 78}=3.96\) is substantially smaller than the test statistic \(F_{0}=49.28\), implying that the two-treatment model is statistically significant.

This, in turn, means that the model recognises a statistically significant difference between the students’ scores, based on the teaching methods.

The coefficient of determination \(R^2\) provides a way to measure the model’s significance. We can compute it directly from the ANOVA table produced by lm(): \[R^{2}=\frac{\text{SS}_{\textrm{treat}}}{\text{SS}_{\textrm{tot}}}=\frac{300.31}{775.69}\approx 0.39,\] or we can extract it from model.lm:

# provide a summary of the linear regression model
summary(model.lm)

## 
## Call:
## lm(formula = Grade ~ Group, data = teaching)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.625 -1.500  0.000  1.906  5.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  75.1250     0.3903  192.46  < 2e-16 ***
## GroupB        3.8750     0.5520    7.02 7.23e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 78 degrees of freedom
## Multiple R-squared:  0.3872, Adjusted R-squared:  0.3793 
## F-statistic: 49.28 on 1 and 78 DF,  p-value: 7.227e-10

# highlights which attributes exist to find R2
attributes(summary(model.lm))

## $names
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 
## 
## $class
## [1] "summary.lm"

# extract R2
(R2 = summary(model.lm)$r.squared)

## [1] 0.3871566

Either way, we find \(R^2\approx 0.39\), which means that \(39\%\) of the total variation in the data can be explained by our two-treatment model.

Is this good enough? That depends on the specifics of the situation (in particular, on the researchers’ needs).

As with most statistical procedures, ANOVA relies on certain assumptions for its result to be valid. The main assumption is that the error terms \(\varepsilon_{i,j}\) follow independently and identically distributed ({i.i.d.}) normal distributions. For now, we will assume that this is the case, without proof.

Test the three additional assumptions:

normality of the error terms;
constant variance (within treatment groups), and
equal variances (across treatment groups).

Solution: we can get an idea as to whether these are valid by looking at the diagnostic plots associated with the model.

Normality of the errors can be tested visually with the help of a normal-QQ plot, which compares the standardized residuals quantiles against the theoretical quantiles of the standard normal distribution \(N(0,1)\) (a straight line indicates normality… some departures in the tails are allowed).

To test the assumption of constant and equal variance, we can run visual inspection using residuals vs. fitted values plots, and/or residuals vs. order/time.

plot(model.lm)

THe visual evidence seems to support the use of the two-treatment model.

4.5 Hydrocarbons

Consider the following data, consisting of \(n=20\) paired measurements \((x_i,y_i)\) of hydrocarbon levels (\(x\)) and pure oxygen levels (\(y\)) in fuels:

x=c(0.99, 1.02, 1.15, 1.29, 1.46, 1.36, 0.87, 1.23, 1.55, 1.40,
      1.19, 1.15, 0.98, 1.01, 1.11, 1.20, 1.26, 1.32, 1.43, 0.95)    
y=c(90.01, 89.05, 91.43, 93.74, 96.73, 94.45, 87.59, 91.77, 99.42, 93.65, 
      93.54, 92.52, 90.56, 89.54, 89.85, 90.39, 93.25, 93.41, 94.98, 87.33)

Describe the relationship between \(x\) and \(y\). What is the strength of association between \(x\) and \(y\)?

Solution: we can compute the strength of association using the correlation \[\rho=\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}=\frac{\sum_i (x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum_i (x_i-\overline{x})^2\sum_i (y_i-\overline{y})^2}}.\]

First we plot the data:

# produce the scatterplot of x,y
plot(x,y)

It seems that points lie around a hidden line!

The correlation can be computed directly with corr():

# compute the correlation rho
cor(x,y)

## [1] 0.9367154

or computed using the formula:

Sxy=sum((x-mean(x))*(y-mean(y)))
Sxx=sum((x-mean(x))^2)
Syy=sum((y-mean(y))^2)
(rho=Sxy/(sqrt(Sxx*Syy)))

## [1] 0.9367154

Either way, the correlation is pretty strong, which is aligned with our observation that the points are scattered around a hidden line.

Find and plot the line of best fit for the data.

Solution: we use the lm() function.

# build a data frame with the variables 
fuels=data.frame(x,y)

# run the linear regression
model <- lm(y ~ x, data=fuels)   

# display the model summary
summary(model)

## 
## Call:
## lm(formula = y ~ x, data = fuels)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83029 -0.73334  0.04497  0.69969  1.96809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   74.283      1.593   46.62  < 2e-16 ***
## x             14.947      1.317   11.35 1.23e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.087 on 18 degrees of freedom
## Multiple R-squared:  0.8774, Adjusted R-squared:  0.8706 
## F-statistic: 128.9 on 1 and 18 DF,  p-value: 1.227e-09

It seems that the line of best fit is \(y=b_0+b_1x= 74.283+14.947x\); the predicted values are \(\hat{y}_i=74.283+14.947x_i\), and the residuals are \(e_i=y_i-\hat{y}_i\).

We can plot the line of best fit and the residuals directly:

# for line of best fit, residual plots
library(ggplot2)   

# line of best fit
ggplot(model) + geom_point(aes(x=x, y=y)) +   
    geom_line(aes(x=x, y=.fitted), color="blue" ) +
    theme_bw()

# residuals
ggplot(model) + geom_point(aes(x=x, y=y)) +   ### plotting residuals
    geom_line(aes(x=x, y=.fitted), color="blue" ) + 
    geom_linerange(aes(x=x, ymin=.fitted, ymax=y), color="red") +
    theme_bw()

What is the estimated variance of the noise (the residuals)?

Solution: if the linear regression assumptions are satisfied, we expect the residuals to follow a \(N(0,\sigma^2)\) distribution. The mean squared error (MSE) \[ \hat{\sigma}^2 =\frac{\sum_ie_i^2}{n-2}=\frac{\sum_i(y_i-\hat{y}_i)}{n-2}=\frac{S_{yy}-b_1S_{xy}}{n-2},\] where \(n\) represents the number of observations.

To compute the variance in a full population, we would use a denominator of \(n\). For a population sample, we would use \(n-1\). For the regression error, we use \(n-2\) because \(2\) parameters had to be estimated in order to obtain \(\hat{y}_i\): \(b_0\) and \(b_1\).

We can compute it directly using the formula:

# number of observations 
n=length(x)

# extracting b1 as the 2nd entry in model$coefficients (verify!)
sigma2 = (Syy-as.numeric(model$coefficients[2])*Sxy)/(n-2)   
sigma2

## [1] 1.180545

But we can also extract it directly from the model summary:

# to find the estimated sigma (note that attributes(model) does not list sigma)
attributes(summary(model))

## $names
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 
## 
## $class
## [1] "summary.lm"

# getting the MSE from the summary
summary(model)$sigma^2

## [1] 1.180545

The estimate is \(\hat{sigma}^2\approx 1.18\).

5. DESCRIPTIVE STATISTICS

We can use R and ggplot2 to present information about univariate datasets.

5.1 Midterm Results

The grades for a midterm are shown below. Discuss the results.

grades<-c(80,73,83,60,49,96,87,87,60,53,66,83,32,80,66,90,72,55,76,46,48,69,45,48,77,52,59,97,76,89,73,73,48,59,55,76,87,55,80,90,83,66,80,97,80,55,94,73,49,32,76,57,42,94,80,90,90,62,85,87,97,50,73,77,66,35,66,76,90,73,80,70,73,94,59,52,81,90,55,73,76,90,46,66,76,69,76,80,42,66,83,80,46,55,80,76,94,69,57,55,66,46,87,83,49,82,93,47,59,68,65,66,69,76,38,99,61,46,73,90,66,100,83,48,97,69,62,80,66,55,28,83,59,48,61,87,72,46,94,48,59,69,97,83,80,66,76,25,55,69,76,38,21,87,52,90,62,73,73,89,25,94,27,66,66,76,90,83,52,52,83,66,48,62,80,35,59,72,97,69,62,90,48,83,55,58,66,100,82,78,62,73,55,84,83,66,49,76,73,54,55,87,50,73,54,52,62,36,87,80,80)

Solution: we can start by building a histogram of the grades.

hist(grades, breaks = seq(20,100,10))

We can find the mode of the data with this function:

fun.mode<-function(x){as.numeric(names(sort(-table(x)))[1])}

Let’s spruce up the histogram a bit by adding some details through ggplot2:

library(ggplot2)
ggplot(data=data.frame(grades), aes(grades)) + 
    geom_histogram(aes(y =..density..),           # approximated pdf
                 breaks=seq(20, 100, by = 10),    # 8 bins from 20 to 100 
                 col="black",                     # colour of outline
                 fill="blue",                     # fill colour of bars
                 alpha=.2) +                      # transparency
    geom_density(col=2) +                         # colour of pdf curve
    geom_rug(aes(grades)) +                       # adding a rug on x-axis
    geom_vline(aes(xintercept = mean(grades)),col='red',size=2) + # vertical line for mean
    geom_vline(aes(xintercept = median(grades)),col='darkblue',size=2) + # vertical line for median
    geom_vline(aes(xintercept = fun.mode(grades)),col='black',size=2) # vertical line for mode

We can also visualize the data using a boxplot:

boxplot(grades)

And get summary statistics using summary() and psych’s describe():

summary(grades)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   55.00   70.00   68.74   82.50  100.00

library(psych)
describe(grades)

##    vars   n  mean    sd median trimmed   mad min max range  skew kurtosis
## X1    1 211 68.74 17.37     70   69.43 19.27  21 100    79 -0.37    -0.46
##     se
## X1 1.2

How is the class doing?

6. CONFIDENCE INTERVALS

What can we say about confidence intervals?

6.1 Interpretation

A sample of size \(n=17\) is selected from a normal population with mean \(\mu=-3\) (which is unknown) and standard deviation \(\sigma=2\), which is known.

Build a \(95\%\) confidence interval for \(\mu\).

Solution: we sample using rnorm().

set.seed(0)  # for replicability
n=17
mu=-3
sigma=2
x=rnorm(n,mu,sigma)

The histogram of the sample looks like:

hist(x)

The sample mean is given by:

mean(x)

## [1] -2.804917

The \(95\%\) confidence interval is

alpha = 0.05
lower.bound = mean(x) + qnorm(alpha/2)*sigma/sqrt(n)   # read the documentation for qnorm(), to be certain to use the right form
upper.bound = mean(x) + qnorm(1-alpha/2)*sigma/sqrt(n)
c(lower.bound,upper.bound)

## [1] -3.755640 -1.854195

We notice that \(\mu=3\) is indeed found in the confidence interval:

lower.bound<mu & mu<upper.bound

## [1] TRUE

Repeat the process \(N=1000\) times. How often does \(\mu\) fall in the confidence interval?

Solution: we just put all the code together in one routine:

set.seed(0)  # for replicability

# parameters
n=17
mu=-3
sigma=2
alpha=0.05
N=1000

# initializing the vector which determines if mu is in the C.I.
is.mu.in <- c() 

# initializng the vector of sample means
sample.means <- c() 

for(j  in 1:N){
  x=rnorm(n,mu,sigma)
  sample.means[j] = mean(x)
  lower.bound = sample.means[j] + qnorm(alpha/2)*sigma/sqrt(n)
  upper.bound = sample.means[j] + qnorm(1-alpha/2)*sigma/sqrt(n)
  is.mu.in[j] = lower.bound<mu & mu<upper.bound
}

# proportion of the times when mu was in the C.I.
table(is.mu.in)/N

## is.mu.in
## FALSE  TRUE 
## 0.055 0.945

# histogram of the sample means (compare with histogram of sample)
hist(sample.means)

What happense if the parameters n, alpha, and/or N are modified?

Build a \(95\%\) confidence interval assuming that the sample of size \(n=17\) is drawn from a normal population where neither the mean \(\mu\) nor the variance \(\sigma^2\) is known.

Solution: we’ll use the same sample as in the first question, to make the comparison more meaningful.

set.seed(0)  # for replicability
n=17
mu=-3
sigma=2
x=rnorm(n,mu,sigma)

The sample mean and the sample variance are given by:

mean(x)

## [1] -2.804917

var(x)

## [1] 4.286731

The \(95\%\) confidence interval in this case is

# read the documentation for qt(), to make sure you are using it properly
alpha = 0.05
lower.bound = mean(x) + qt(alpha/2, df = n-1)*sqrt(var(x))/sqrt(n)   
upper.bound = mean(x) + qt(1-alpha/2, df = n-1)*sqrt(var(x))/sqrt(n)
c(lower.bound,upper.bound)

## [1] -3.869441 -1.740394

We notice that \(\mu=3\) is indeed found in the confidence interval:

lower.bound<mu & mu<upper.bound

## [1] TRUE

Repeat the process \(N=1000\) times. How often does \(\mu\) fall in the confidence interval?

Solution: we just put all the code together in one routine:

set.seed(0)  # for replicability

# parameters
n=17
mu=-3
sigma=2
alpha=0.05
N=1000

# initializing the vector which determines if mu is in the C.I.
is.mu.in <- c() 

# initializng the vector of sample means
sample.means <- c() 

for(j  in 1:N){
  x=rnorm(n,mu,sigma)
  sample.means[j] = mean(x)
  lower.bound = sample.means[j] + qt(alpha/2, df = n-1)*sqrt(var(x))/sqrt(n)
  upper.bound = sample.means[j] + qt(1-alpha/2, df = n-1)*sqrt(var(x))/sqrt(n)
  is.mu.in[j] = lower.bound<mu & mu<upper.bound
}

# proportion of the times when mu was in the C.I.
table(is.mu.in)/N

## is.mu.in
## FALSE  TRUE 
##  0.05  0.95

# histogram of the sample means (compare with histogram of sample)
hist(sample.means)

Any surprises?

6.2 Bootstrapping

5.4.9 in JWHT, p. 201

library(MASS)

set.seed(1234)
attach(Boston)

# a)
medv.mean = mean(medv)
medv.mean

## [1] 22.53281

# b)
medv.err = sd(medv)/sqrt(length(medv))
medv.err

## [1] 0.4088611

# c)
boot.fn = function(data, index){return(mean(data[index]))}
library(boot)

## 
## Attaching package: 'boot'

## The following object is masked from 'package:psych':
## 
##     logit

bstrap = boot(medv, boot.fn, 1000)
bstrap

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1* 22.53281 0.0199747   0.4173862

sd.medv = sd(bstrap[[2]])

# d)
t.test(medv)

## 
##  One Sample t-test
## 
## data:  medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

c(bstrap$t0 - 2*sd.medv, bstrap$t0 + 2*sd.medv)

## [1] 21.69803 23.36758

# e)
medv.med = median(medv)
medv.med

## [1] 21.2

# f)
boot.fn = function(data, index){
    return(median(data[index]))}
boot(medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0243   0.3736497

# g)
medv.tenth = quantile(medv, c(0.1))
medv.tenth

##   10% 
## 12.75

# h)
boot.fn = function(data, index){
    return(quantile(data[index], c(0.1)))}
boot(medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0389   0.5016258

7. HYPOTHESIS TESTING

There are built in functions in R that allow for hypothesis testing. For instance,

t.test(x,mu=5) tests for \(H_0:\mu=5\) against \(H_1:\mu\not=5\) when \(\sigma\) is unknown \(2-\)sided \(t-\)test)
t.test(x,mu=5,alternative="greater") tests for \(H_0:\mu=5\) against \(H_1:\mu>5\) when \(\sigma\) is unknown (\(1-\)sided \(t-\)test)
t.test(x,mu=5,alternative="less") tests for \(H_0:\mu=5\) against \(H_1:\mu<5\) when \(\sigma\) is unknown (\(1-\)sided \(t-\)test)
t.test(x,y,var.equal=TRUE) tests for \(H_0:\mu_1=\mu_2\) against\(H_1:\mu_1\neq \mu_2\) in case of two independent samples, when variances are unknown but equal.
t.test(x,y,var.equal=TRUE,alternative="greater") tests for \(H_0:\mu_1=\mu_2\) against \(H_1:\mu_1>\mu_2\) in case of two independent samples, when variances are unknown but equal.
t.test(x,y,var.equal=TRUE,alternative="less") tests for \(H_0:\mu_1=\mu_2\) against \(H_1:\mu_1<\mu_2\) in case of two independent samples, when variances are unknown but equal.

For all these tests, we reject the null hypothesis \(H_0\) at significance level \(\alpha\) if the \(p-\)value of the test is below \(\alpha\) (which means that the probability of wrongly rejecting \(H_0\) when \(H_0\) is in fact true is below \(\alpha\), usually taken to be \(0.05\) or \(0.01\)).

If the \(p-\)value of the test is greater than the significance level \(\alpha\), then we fail to reject the null hypothesis \(H_0\) at significance level \(\alpha\), WHICH IS NOT THE SAME AS ACCEPTING THE NULL HYPOTHESIS!!

Note that the \(p-\)value for the test will appear in the output, but it can also be computed directly using the appropriate formula. The corresponding \(95\%\) confidence intervals also appear in the output.

7.1 Artificial Examples

Let’s say that we have a small dataset with \(n=7\) observations:

x=c(4,5,4,6,4,4,5)

Let \(\mu_x\) be the true mean of whatever distribution the sample came from. Is it conceivable that \(\mu_x=5\)?

Solution: we can test for \(H_0: \mu_x=5\) against \(H_1:\mu_x\neq 5\) simply by calling

t.test(x,mu=5)

## 
##  One Sample t-test
## 
## data:  x
## t = -1.4412, df = 6, p-value = 0.1996
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
##  3.843764 5.299093
## sample estimates:
## mean of x 
##  4.571429

All the important information is in the output: the critical \(t-\)value from Student’s \(T-\)distribution with \(n-1=6\) degrees of freedom \(t^*=-1.4412\), the probability of wrongly rejecting \(H_0\) if it was in fact true (\(p-\)value \(=0.1996\)), and the \(95\%\) confidence interval \((3.843764, 5.299093)\) for \(\mu_x\), whose point estimate is \(\overline{x}=4.571429\).

Since the \(p-\)value is greater than \(\alpha=0.05\), we fail to reject the null hypothesis that \(\mu_x=5\); there is not enough evidence in the data to categorically state that \(\mu_x\neq 5\).

(is it problematic that the sample size \(n=7\) is small?)

Let’s say that now we have a small dataset with \(n=9\) observations:

y=c(1,2,1,4,3,2,4,3,2)

Let \(\mu_y\) be the true mean of whatever distribution the sample came from. Is it conceivable that \(\mu_x=5\)?

Solution: we can test for \(H_0: \mu_y=5\) against \(H_1:\mu_y\neq 5\) simply by calling

t.test(y,mu=5)

## 
##  One Sample t-test
## 
## data:  y
## t = -6.7823, df = 8, p-value = 0.0001403
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
##  1.575551 3.313338
## sample estimates:
## mean of x 
##  2.444444

The \(p-\)value is \(0.0001403\), which is substantially smaller than \(\alpha=0.05\), and we reject the null hypothesis that the true mean is \(5\). The test provides no information about what the true mean could be, but the \(95\%\) confidence interval \((1.575551, 3.313338)\) does: we would expect \(\mu_y\approx 2.5\).

Is it conceivable that \(\mu_y=2.5\)?

Solution: let’s run

t.test(y,mu=2.5)

## 
##  One Sample t-test
## 
## data:  y
## t = -0.14744, df = 8, p-value = 0.8864
## alternative hypothesis: true mean is not equal to 2.5
## 95 percent confidence interval:
##  1.575551 3.313338
## sample estimates:
## mean of x 
##  2.444444

Large \(p-\)value, much above \(\alpha=0.05\)… we’d like to say that this clinches it: we accept the null hypothesis! But no, no. We can’t do that. All we can say, sadly, is that we do not have enough evidence to reject the null hypothesis that \(\mu_y=2.5\).

7.2 Teaching

She recruits \(N=80\) second-year students to test her claim. The students are randomly assigned to one of two groups:

students in group \(A\) are given the traditional lectures,
whereas students in group \(B\) are taught using the new teaching method.

After three weeks, a short quiz is administered to the students in order to assess their understanding of statistical concepts.

The results are found in the teaching.csv dataset.

teaching <- read.csv("Data/teaching.csv", header = TRUE)
head(teaching)

##   ID Group Grade
## 1  1     B  75.5
## 2  2     B  77.5
## 3  3     A  73.5
## 4  4     A  75.0
## 5  5     B  77.0
## 6  6     A  79.0

Is there enough evidence to suggest that the new teaching is more effective (as measured by test performance)?

Solution: we can summarize the results (sample size, sample mean, sample variance) as follows:

counts.by.group = aggregate(x = teaching$Grade, 
                            by = list(teaching$Group),
                            FUN = length)

means.by.group = aggregate(x = teaching$Grade, 
  by = list(teaching$Group),
  FUN = mean)

variances.by.group = aggregate(x = teaching$Grade, 
                           by = list(teaching$Group),
                           FUN = var)


teaching.summary <- counts.by.group %>% 
                      full_join(means.by.group,
                                by="Group.1" ) %>%
                                  full_join(variances.by.group,
                                            by="Group.1" )
  
colnames(teaching.summary) <- c("Group", 
                                "Sample Size", 
                                "Sample Mean", 
                                "Sample Variance")

teaching.summary

##   Group Sample Size Sample Mean Sample Variance
## 1     A          40      75.125        6.650641
## 2     B          40      79.000        5.538462

If the researcher assumes that both groups have similar background knowledge prior to being taught (which she attempt to enforce by randomising the group assignment), then the effectiveness of the teaching methods may be compared using two hypotheses: the null hypothesis \(H_0\) and the alternative \(H_{1}\).

Let \(\mu_i\) represent the true performance of method \(i\).

Since the researcher wants to claim that the new method is more effective than the traditional ones, it is most appropriate for her to use one-sided hypothesis testing with \[H_{0}: \mu_{A} \geq \mu_{B} \quad\mbox{against}\quad H_{1}: \mu_{A} < \mu_{B}.\]

The testing procedure is simple:

calculate an appropriate test statistic under \(H_0\);
reject \(H_0\) in favour of \(H_1\) if the test statistic falls in the critical region (also called the rejection region) of an associated distribution, and
fail to reject \(H_0\) otherwise (WHICH I REMIND YOU IS NOT THE SAME AS ACCEPTING \(H_0\)!!)

In this case, she wants to use a two-sample \(t-\)test. Assuming that variability in two groups are roughly the same, the test statistic is given by:

\[ t_{0}=\frac{\bar{y}_{B}-\bar{y}_{A}}{S_{p}\sqrt{\frac{1}{N_{A}}+\frac{1}{N_{B}}}}, \] where the pooled variance \(S^{2}_{p}\) is \[ S^{2}_{p}=\frac{(N_{A}-1)S^{2}_{A}+(N_{B}-1)S^{2}_{B}}{N_{A}+N_{B}-2}. \]

With her data, she obtains the following \(t-\)statistic:

# number of observations in each group
N.A = teaching.summary[1,2]
N.B = teaching.summary[2,2]
N=N.A+N.B

# sample mean score in each group
y.bar.A = teaching.summary[1,3]
y.bar.B = teaching.summary[2,3]

# sample variance of scores in each group
S2.A = teaching.summary[1,4]
S2.B = teaching.summary[2,4]

# sample pooled variance of scores
S2.P = ((N.A-1)*S2.A+(N.A-1)*S2.B)/(N.A+N.B-2)

# t-statistic
t0 = (y.bar.B - y.bar.A) / sqrt(S2.P*(1/N.A+1/N.B))
t0

## [1] 7.019656

The test statistic value is \(t_{0} = 7.02\).

In order to reject or fail to reject the null hypothesis, she needs to compare it against the critical value of the Student \(T\) distribution with \(N-2=78\) degrees of freedom at significance level \(\alpha=0.05\), say.

# significance level
alpha=0.05

# this will give you a critical value on the wrong side of the distribution's mean
t.star = qt(alpha,N-2)
t.star

## [1] -1.664625

# this gives the correct critical value 
t.star = qt(alpha,N-2, lower.tail=FALSE)
t.star

## [1] 1.664625

The appropriate critical value is \[t^*= t_{1-\alpha, N-2}=t_{0.95, 78}=1.665.\]

Since \(t_{0} > t^*\) at \(\alpha=0.05\), she happily rejects the null hypothesis \(H_0:\mu_A\geq \mu_B\), which is to say that she has enough evidence to support the claim that the new teaching method is more effective than the traditional methods, at \(\alpha=0.05\).

7.3 Hydrocarbons (reprise)

Consider the following data, consisting of \(n=20\) paired measurements \((x_i,y_i)\) of hydrocarbon levels (\(x\)) and pure oxygen levels (\(y\)) in fuels:

x=c(0.99, 1.02, 1.15, 1.29, 1.46, 1.36, 0.87, 1.23, 1.55, 1.40,
      1.19, 1.15, 0.98, 1.01, 1.11, 1.20, 1.26, 1.32, 1.43, 0.95)    
y=c(90.01, 89.05, 91.43, 93.74, 96.73, 94.45, 87.59, 91.77, 99.42, 93.65, 
      93.54, 92.52, 90.56, 89.54, 89.85, 90.39, 93.25, 93.41, 94.98, 87.33)

The line of best fit is given by the lm() function.

# build a data frame with the variables 
fuels=data.frame(x,y)

# run the linear regression
model <- lm(y ~ x, data=fuels)   

# display the model summary
summary(model)

## 
## Call:
## lm(formula = y ~ x, data = fuels)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83029 -0.73334  0.04497  0.69969  1.96809 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   74.283      1.593   46.62  < 2e-16 ***
## x             14.947      1.317   11.35 1.23e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.087 on 18 degrees of freedom
## Multiple R-squared:  0.8774, Adjusted R-squared:  0.8706 
## F-statistic: 128.9 on 1 and 18 DF,  p-value: 1.227e-09

# some quantities that will be useful
n= length(x)
Sxx=sum((x-mean(x))^2)
Syy=sum((y-mean(y))^2)
Sxy=sum((x-mean(x))*(y-mean(y)))
sigma2 = (Syy-as.numeric(model$coefficients[2])*Sxy)/(n-2)  

b0 = as.numeric(model$coefficients[1])  ### intercept
b1 = as.numeric(model$coefficients[2])  ### slope
beta00 = 75  ### for 1.
beta10 = 10  ### for 2.

We might be interested in testing whether the true intercept \(\beta_0\) is equal to some candidate value \(\beta_{0,0}\), i.e. \[H_0: \beta_0=\beta_{0,0}~~\mbox{against}~~ H_1:\beta_0\not= \beta_{0,0}.\]

The test statistic for this situation \[ t_0=\frac{b_0-\beta_{0,0}}{\sqrt{\hat{\sigma}^2\frac{\sum x_i^2}{nS_{xx}}}} \] follows a Student \(t-\)distribution with \(n-2\) degrees of freedom.

The critical/rejection region in this case is:

alternative hypothesis: \(H_1:\beta_0>\beta_{0,0}\); critical/rejection region: \(t_0>t_\alpha(n-2)\)
alternative hypothesis: \(H_1:\beta_0<\beta_{0,0}\); critical/rejection region: \(t_0<-t_\alpha(n-2)\)
alternative hypothesis: \(H_1:\beta_0\neq \beta_{0,0}\); critical/rejection region: \(|t_0|>t_{\alpha/2}(n-2)\)

The decision rule is: Reject \(H_0\) if \(t_0\) is in the critical region.

If instead we are interested in testing whether the true slope \(\beta_1\) is equal to some candidate value \(\beta_{1,0}\), i.e. \[H_0: \beta_1=\beta_{1,0}~~\mbox{against}~~ H_1:\beta_1\not= \beta_{1,0},\]

the test statistic for this situation is \[ t_0=\frac{b_1-\beta_{1,0}}{\sqrt{\hat{\sigma}^2/S_{xx}}} \] which follows a Student \(t-\)distribution with \(n-2\) degrees of freedom.

The critical/rejection region in this case is:

alternative hypothesis: \(H_1:\beta_1>\beta_{1,0}\); critical/rejection region: \(t_0>t_\alpha(n-2)\)
alternative hypothesis: \(H_1:\beta_1<\beta_{1,0}\); critical/rejection region: \(t_0<-t_\alpha(n-2)\)
alternative hypothesis: \(H_1:\beta_1\neq \beta_{1,0}\); critical/rejection region: \(|t_0|>t_{\alpha/2}(n-2)\)

The decision rule is: Reject \(H_1\) if \(t_0\) is in the critical region.

Test for \(H_0:\beta_0=75\) against \(H_1:\beta_0<75\) at \(\alpha=0.05\).

Solution:

# test statistic
(t0a = (b0-beta00)/sqrt(sigma2*sum(x^2)/n/Sxx))

## [1] -0.4497632

# critical value at alpha=0.05
(crit_t005_18a = qt(0.05,n-2))

## [1] -1.734064

# test for critical region
t0a < crit_t005_18a

## [1] FALSE

As the test statistic is larger than the critical value, we fail to reject \(H_0\).

Test for \(H_0:\beta_1=10\) against \(H_1:\beta_1> 10\) at \(\alpha=0.05\).

Solution:

# test statistic
(t0b = (b1-beta10)/sqrt(sigma2/Sxx))

## [1] 3.757318

# critical value at alpha=0.05
(crit_t005_18b = - qt(0.05,n-2))

## [1] 1.734064

# test for critical region
t0b > crit_t005_18b

## [1] TRUE

As the test statistic is smaller than the critical value, we reject \(H_0\).

Test for \(H_0:\beta_1=0\) against \(H_1:\beta_1\neq 0\) at \(\alpha=0.05\).

Solution:

# test statistic
(t0c = b1/sqrt(sigma2/Sxx))

## [1] 11.35173

# critical value at alpha=0.05
(crit_t0025_18c = - qt(0.025,18))

## [1] 2.100922

# test for critical region
abs(t0c) > crit_t0025_18c

## [1] TRUE

As the test statistic is smaller than the critical value, we reject \(H_0\).

8. MISC.

Here we give a pot-pourri of miscellaneous results.

8.1 Monty Hall Problem

Suppose you are on a game show, and you are given the choice of three doors. Behind one door is a prize; behind the others, dirty and smelly rubbish bins.

You pick a door, say No. 1, and the host, who knows what is behind the doors, opens another door, say No. 3, behind which is a bin. She then says to you, “Do you want to switch from door No. 1 to No. 2?” Is it to your advantage to do so? Solution: this classic problem has left even the greatest minds gasping for air.

In what follows, let \(S\) and \(D\) represent the events that switching to another door is a successful strategy and that the prize is behind the original door, respectively.

We start by tackling a different scenario, in which the host DOES NOT open one of the two other doors after you have selected a door. What is the probability that switching to another door would prove to be a successful strategy?

If the prize is behind the original door, switching would succeed \(0\%\) of the time: \(P(S\mid D)=0\). Note that the prior is\(P(D)=1/3\).
If the prize is not behind the original door, switching would succeed \(50\%\) of the time: \(P(S \mid D^c)=1/2\). Note that the prior is\(P(D^c)=2/3\).

Thus, \[\begin{align*}P(S) = P(S\mid D) P(D) +P(S\mid D^c)P(D^c) =0\cdot \frac{1}{3}+\frac{1}{2}\cdot \frac{2}{3}\approx 33\%. \end{align*}\]

Does this make sense to you? There is no sense in switching because, in this scenario, switching is equivalent to having changed your mind about which door to select: the prize is behind one of the three doors, and whether you select the door originally or after having been offered the choice to switch, the probability that the prize is behind any door is just 1/3.

Now let’s return to the original scenario, in which the host opens one of the other two doors to show a rubbish bin. What is the probability that switching to another door in this scenario would prove to be a successful strategy?

If the prize is behind the original door, switching would still succeed \(0\%\) of the time: \(P(S\mid D)=0\). Note that the prior is\(P(D)=1/3\).
If the prize is not behind the original door, switching would succeed \(100\%\) of the time: \(P(S\mid D^c)=1\). Note that the prior is \(P(D^c)=2/3\).

Thus, \[\begin{align*}P(S) = P(S\mid D) P(D) +P(S\mid D^c)P(D^c) =0\cdot \frac{1}{3}+1\cdot \frac{2}{3}\approx 67\%. \end{align*}\]

Evidently, it makes sense to switch in the original scenario. This is counter-intuitive to a lot of people: the prize has a probability 1/3 of being behind any of the 3 doors, so why is the probability not 1/3, as described above? Isn’t this just the same problem?

Well, something has changed: the information in your possession changes once thehost opensoneof the two non-prize doors. You nowknow the location of one of the bins with \(100\%\) certainty.

Do you find this explanation satisfying?

Perhaps you would rather see what happens in practice: if you could pit two players against one another (one who never switches and one who always does so) in a series of Monty Hall games, which one would come out on top in the longrun?

# number of games (if N is too small, we don't have long-run behaviour)
N=500 

# fixing the seed for replicability purposes
set.seed(1234) 

# placing the prize behind one of the 3 doors for each game
locations = sample(c("A","B","C"), N, replace = TRUE) 

# verification that the prize gets placed behind each door roughly 33% of the time
table(locations)/N

## locations
##     A     B     C 
## 0.302 0.344 0.354

# getting the playerto select a door for each of the N games (in practice, should be independent of where the prize actually is)
player.guesses = sample(c("A","B","C"), N, replace = TRUE) 

# create a data frame telling the analyst where the prize is, and what door the player has selected
games = data.frame(locations, player.guesses) 
head(games)

##   locations player.guesses
## 1         B              B
## 2         B              B
## 3         A              B
## 4         C              C
## 5         A              C
## 6         A              A

# how often does the player guess correctly, before a door is opened (should be about 33%)
table(games$locations==games$player.guesses)

## 
## FALSE  TRUE 
##   333   167

# initialize the process to find out which door the host will open
games$open.door <- NA

# for each game, the host opens a door which is not the one selected by the player, nor the one behind which the prize is found 
# the union() call enumerates the doors that the host cannot open
# the setdiff() call finds the complementof the doors that thehost cannotopen (i.e.: the doors that she can open)
# the sample() call picks one of those doors
for(j in 1:N){
games$open.door[j] <- sample(setdiff(c("A","B","C"), union(games$locations[j],games$player.guesses[j])), 1)
}

head(games)

##   locations player.guesses open.door
## 1         B              B         C
## 2         B              B         C
## 3         A              B         C
## 4         C              C         A
## 5         A              C         B
## 6         A              A         B

# if the player neverswitches, they win whenever they had originally guessed the location of the prize correctly
games$no.switch.win <- games$player.guess==games$locations

# let's find which doorthe playerwould have selected if they alwaysswitched (the doorthatis neither the location of the prize nor theonethey hadoriginally selected)
games$switch.door <- NA

for(j in 1:N){
games$switch.door[j] <- sample(setdiff(c("A","B","C"), union(games$open.door[j],games$player.guesses[j])), 1)
}

# if the player always switches, they win whenever their switched guess is wheretheprize is located
games$switch.win <- games$switch.door==games$locations

head(games)

##   locations player.guesses open.door no.switch.win switch.door switch.win
## 1         B              B         C          TRUE           A      FALSE
## 2         B              B         C          TRUE           A      FALSE
## 3         A              B         C         FALSE           A       TRUE
## 4         C              C         A          TRUE           B      FALSE
## 5         A              C         B         FALSE           A       TRUE
## 6         A              A         B          TRUE           C      FALSE

# chance of winning by not switching
table(games$no.switch.win)/N

## 
## FALSE  TRUE 
## 0.666 0.334

# chance of winning by switching
table(games$switch.win)/N

## 
## FALSE  TRUE 
## 0.334 0.666

Pretty wild, eh?

Data Analysis with R

P. Boily

7/31/2020

TUTORIAL OUTLINE

1. SAMPLING

1.1 Conceptual Questions

1.2 `cities`

2. RANDOM VARIABLES AND DISTRIBUTIONS

2.1 Uniform \(U(a,b)\)

2.2 Normal \(N(\mu,\sigma^2)\)

2.3 Poisson \(P(\lambda)\)

2.4 Binomial \(B(n,p)\)

2.5 Log-Normal \(\Lambda(\mu,\sigma^2)\)

2.6 Exp \(E(r)\)

2.7 Gamma \(\Gamma(r,k)\)

2.8 Joint Normal

2.9 Hospital Night Shift

3. CENTRAL LIMIT THEOREM

3.1 Freight Elevators

4. REGRESSION

4.1 Auto Parts

4.2 Mileage

4.3 Boston

4.4 ANOVA

4.5 Hydrocarbons

5. DESCRIPTIVE STATISTICS

5.1 Midterm Results

6. CONFIDENCE INTERVALS

6.1 Interpretation

6.2 Bootstrapping

7. HYPOTHESIS TESTING

7.1 Artificial Examples

7.2 Teaching

7.3 Hydrocarbons (reprise)

8. MISC.

8.1 Monty Hall Problem

Data Analysis with R

P. Boily

7/31/2020

TUTORIAL OUTLINE

1. SAMPLING

1.1 Conceptual Questions

1.2 cities

2. RANDOM VARIABLES AND DISTRIBUTIONS

2.1 Uniform \(U(a,b)\)

2.2 Normal \(N(\mu,\sigma^2)\)

2.3 Poisson \(P(\lambda)\)

2.4 Binomial \(B(n,p)\)

2.5 Log-Normal \(\Lambda(\mu,\sigma^2)\)

2.6 Exp \(E(r)\)

2.7 Gamma \(\Gamma(r,k)\)

2.8 Joint Normal

2.9 Hospital Night Shift

3. CENTRAL LIMIT THEOREM

3.1 Freight Elevators

4. REGRESSION

4.1 Auto Parts

4.2 Mileage

4.3 Boston

4.4 ANOVA

4.5 Hydrocarbons

5. DESCRIPTIVE STATISTICS

5.1 Midterm Results

6. CONFIDENCE INTERVALS

6.1 Interpretation

6.2 Bootstrapping

7. HYPOTHESIS TESTING

7.1 Artificial Examples

7.2 Teaching

7.3 Hydrocarbons (reprise)

8. MISC.

8.1 Monty Hall Problem

1.2 `cities`