In this notebook, we show how to do more sophisticated charts in R using lattice, ggplot2, and the tidyverse.

As always, do not hesitate to consult online resources for more examples (StackOverflow, R-bloggers, etc.).

TUTORIAL OUTLINE

  1. Titanic Dataset
  2. Census PUMS Data
  3. Gapminder
  4. Minard’s March to Moscow
  5. Snow’s Colera Outbreak Map

1. TITANIC DATASET

We will explore a dataset related to the Titanic. It can be found online (see URL below) in a txt format. We use read.table() to download the dataset to a data frame.

titanic.url <- "http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt"
titanic <- read.table(titanic.url,  sep=",", header=TRUE)
str(titanic)
## 'data.frame':    1313 obs. of  11 variables:
##  $ row.names: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ pclass   : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
##  $ survived : int  1 0 0 0 1 1 1 0 1 0 ...
##  $ name     : Factor w/ 1310 levels "Abbing, Mr Anthony",..: 22 25 26 27 24 31 45 46 50 54 ...
##  $ age      : num  29 2 30 25 0.917 ...
##  $ embarked : Factor w/ 4 levels "","Cherbourg",..: 4 4 4 4 4 4 4 4 4 2 ...
##  $ home.dest: Factor w/ 372 levels "","?Havana, Cuba",..: 312 233 233 233 233 239 164 26 24 231 ...
##  $ room     : Factor w/ 54 levels "","2131","A-11",..: 13 42 42 42 41 51 48 6 19 1 ...
##  $ ticket   : Factor w/ 42 levels ""," ","111361 L57 19s 7d",..: 31 1 1 1 1 1 9 1 1 1 ...
##  $ boat     : Factor w/ 100 levels "","(101)","(103)",..: 88 1 13 1 80 89 79 1 88 35 ...
##  $ sex      : Factor w/ 2 levels "female","male": 1 1 2 1 2 2 1 2 1 2 ...

It contains 131 observations and 11 variables (the first of which is a row identifier). We see that most variables are factors (categorical variables).

Based on the name of some of the variables, we might expect some of them to be character variables instead.

head(titanic)
##   row.names pclass survived
## 1         1    1st        1
## 2         2    1st        0
## 3         3    1st        0
## 4         4    1st        0
## 5         5    1st        1
## 6         6    1st        1
##                                              name     age    embarked
## 1                    Allen, Miss Elisabeth Walton 29.0000 Southampton
## 2                     Allison, Miss Helen Loraine  2.0000 Southampton
## 3             Allison, Mr Hudson Joshua Creighton 30.0000 Southampton
## 4 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton
## 5                   Allison, Master Hudson Trevor  0.9167 Southampton
## 6                              Anderson, Mr Harry 47.0000 Southampton
##                         home.dest room     ticket  boat    sex
## 1                    St Louis, MO  B-5 24160 L221     2 female
## 2 Montreal, PQ / Chesterville, ON  C26                  female
## 3 Montreal, PQ / Chesterville, ON  C26            (135)   male
## 4 Montreal, PQ / Chesterville, ON  C26                  female
## 5 Montreal, PQ / Chesterville, ON  C22               11   male
## 6                    New York, NY E-12                3   male

We can remedy the situation by specifying that the first column corresponds to the row number, and by making sure that strings do not get imported as factors.

titanic <- read.table( file=titanic.url, sep=",", header=TRUE, stringsAsFactors=FALSE, row.names=1)
# stringsAsFactors=FALSE to keep strings as strings.
# row.names=1 tells read.table that column 1 contains the row names.

We do not need to call str() to see the type for each of the variables; we can use sapply() instead.

sapply(titanic, class)
##      pclass    survived        name         age    embarked   home.dest 
## "character"   "integer" "character"   "numeric" "character" "character" 
##        room      ticket        boat         sex 
## "character" "character" "character" "character"

Now the problem is that some of the variables that should have been (ordered) factors or logicals are now character strings.

We will fix that issue by using the transform() function.

titanic <- transform(titanic, 
                     pclass = ordered(pclass, levels=c("3rd", "2nd", "1st")),
                     survived = as.logical(survived), 
                     embarked = as.factor(embarked), 
                     sex = as.factor(sex))

sapply(titanic, class)
## $pclass
## [1] "ordered" "factor" 
## 
## $survived
## [1] "logical"
## 
## $name
## [1] "character"
## 
## $age
## [1] "numeric"
## 
## $embarked
## [1] "factor"
## 
## $home.dest
## [1] "character"
## 
## $room
## [1] "character"
## 
## $ticket
## [1] "character"
## 
## $boat
## [1] "character"
## 
## $sex
## [1] "factor"

The dataset can be summarized using the summary() function.

summary(titanic)
##  pclass     survived           name                age         
##  3rd:711   Mode :logical   Length:1313        Min.   : 0.1667  
##  2nd:280   FALSE:864       Class :character   1st Qu.:21.0000  
##  1st:322   TRUE :449       Mode  :character   Median :30.0000  
##                                               Mean   :31.1942  
##                                               3rd Qu.:41.0000  
##                                               Max.   :71.0000  
##                                               NA's   :680      
##         embarked    home.dest             room          
##             :492   Length:1313        Length:1313       
##  Cherbourg  :203   Class :character   Class :character  
##  Queenstown : 45   Mode  :character   Mode  :character  
##  Southampton:573                                        
##                                                         
##                                                         
##                                                         
##     ticket              boat               sex     
##  Length:1313        Length:1313        female:463  
##  Class :character   Class :character   male  :850  
##  Mode  :character   Mode  :character               
##                                                    
##                                                    
##                                                    
## 

We notice, among other things,thata number of age observations are missing, and that a fair number of passengers do not have a port of embarcation.

What can we say about this dataset as a whole?

First, build a contingency table of survival by passenger class.

(pclass.survived <- with(titanic, table(pclass, survived)))
##       survived
## pclass FALSE TRUE
##    3rd   574  137
##    2nd   161  119
##    1st   129  193

It certainly seems as though the passenger class was linked to survival (at least, at first glance). A barplot can help illustrate the relationship.

barplot(pclass.survived, beside=TRUE, legend.text=levels(titanic$pclass))

Splitting along the survived (TRUE) / did not survive (FALSE) axis can muddle the situation to some extent: of course there were more 3rd class passenger who didn’t survive – there were more 3rd class passengers, period.

Compare with the following chart.

library(lattice)
barchart(pclass.survived, stack=FALSE, auto.key=TRUE)

The relative proportions of survivors per passenger class is striking.

Here is another way to display the same information.

barchart(t(pclass.survived), groups=FALSE, layout=c(3,1), horizontal=FALSE)

Visualization experts clain that the most efficient visualization dimensions for comparison of quantities are, in order:

Let’s see if that seem reasonable, by comparing how length and angles do the job.

(pass.class <- table(titanic$pclass))
## 
## 3rd 2nd 1st 
## 711 280 322
par(mfrow=c(1,2))   # two charts side by side
pie(pass.class)     # angle
barplot(pass.class) # length

What do you think? Is the much-maligned pie chart THAT bad?

Let’s see what things look like when we combine 3 variables instead of 2.

(sv.pc.sx <- with(titanic, table(survived, pclass, sex)))
## , , sex = female
## 
##         pclass
## survived 3rd 2nd 1st
##    FALSE 134  13   9
##    TRUE   79  94 134
## 
## , , sex = male
## 
##         pclass
## survived 3rd 2nd 1st
##    FALSE 440 148 120
##    TRUE   58  25  59

As a start, the table of frequencies is a tad harder to read/interpret.

But we can get a good sense of the underlying data distributions with a bar chart.

barchart(sv.pc.sx, horizontal=FALSE, stack=FALSE, layout=c(3,1), auto.key=list(columns=2))

Histograms, density plots, and dot plots can also provide an idea of the underlying distributions (implemented with lattice’s histogram(), densityplot(), and dotplot())

Here are histograms of age by survival status:

histogram(~age | survived, data=titanic)