Data Wrangling and the Tidyverse

In this notebook, we will show how you can use the tidy data and various tidyverse functions to simplify data manipulation in R.

Do not hesitate to consult online resources for more examples (StackOverflow, R-bloggers, etc.). In particular, you should strongly consider bookmarking Wickham and Grolemund’s R for Data Science.

INITIALIZATION

This section is based on Data Wrangling with R: How to work with the structures of your data by G. Grolemund (slides available at: bit.ly/wrangling-webinar).

We load the tidyr and dplyr packages, which are fundamental building blocks of the tidyverse.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Some of the datasets we will use can be found in the EDAWR library (they’ve been added to the Data folder in case the current version of the library is not compatible with your installed version of R )

# uncomment as needed
# library(EDAWR)

In particular, we will work with the following datasets:

storms
cases
pollution
tb

str(storms)

## tibble [10,010 × 13] (S3: tbl_df/tbl/data.frame)
##  $ name       : chr [1:10010] "Amy" "Amy" "Amy" "Amy" ...
##  $ year       : num [1:10010] 1975 1975 1975 1975 1975 ...
##  $ month      : num [1:10010] 6 6 6 6 6 6 6 6 6 6 ...
##  $ day        : int [1:10010] 27 27 27 27 28 28 28 28 29 29 ...
##  $ hour       : num [1:10010] 0 6 12 18 0 6 12 18 0 6 ...
##  $ lat        : num [1:10010] 27.5 28.5 29.5 30.5 31.5 32.4 33.3 34 34.4 34 ...
##  $ long       : num [1:10010] -79 -79 -79 -79 -78.8 -78.7 -78 -77 -75.8 -74.8 ...
##  $ status     : chr [1:10010] "tropical depression" "tropical depression" "tropical depression" "tropical depression" ...
##  $ category   : Ord.factor w/ 7 levels "-1"<"0"<"1"<"2"<..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ wind       : int [1:10010] 25 25 25 25 25 25 25 30 35 40 ...
##  $ pressure   : int [1:10010] 1013 1013 1013 1013 1012 1012 1011 1006 1004 1002 ...
##  $ ts_diameter: num [1:10010] NA NA NA NA NA NA NA NA NA NA ...
##  $ hu_diameter: num [1:10010] NA NA NA NA NA NA NA NA NA NA ...

cases <- read.csv("Data/cases.csv")
cases

##   country X2011 X2012 X2013
## 1      FR  7000  6900  7000
## 2      DE  5800  6000  6200
## 3      US 15000 14000 13000

pollution <- read.csv("Data/pollution.csv")
pollution

##       city  size amount
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56

tb <- read.csv("Data/tb.csv")
str(tb)

## 'data.frame':    3800 obs. of  6 variables:
##  $ country: Factor w/ 100 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year   : int  1995 1995 1996 1996 1997 1997 1998 1998 1999 1999 ...
##  $ sex    : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ child  : int  NA NA NA NA 5 0 45 30 25 8 ...
##  $ adult  : int  NA NA NA NA 96 26 1142 500 484 212 ...
##  $ elderly: int  NA NA NA NA 1 0 20 41 8 8 ...

head(tb)

##       country year    sex child adult elderly
## 1 Afghanistan 1995 female    NA    NA      NA
## 2 Afghanistan 1995   male    NA    NA      NA
## 3 Afghanistan 1996 female    NA    NA      NA
## 4 Afghanistan 1996   male    NA    NA      NA
## 5 Afghanistan 1997 female     5    96       1
## 6 Afghanistan 1997   male     0    26       0

2. PIPELINE OPERATOR `%>%`

R is a functional language, which means nested parentheses, which make code hard difficult to read. The pipeline operator %>% and the package dplyr can be used to remedy the situation.

Hadley Wickham provided an example in 2014 to illustrate how it works (don’t run the next bit of code, it won’t execute):

# hourly_delay <- filter(
#   summarise(
#     group_by( 
#       filter(
#         flights, 
#         !is.na(dep_delay)
#       ), 
#       date, hour
#     ), 
#     delay = mean(dep_delay), 
#     n = n()
#   ), 
#   n > 10 
# )

Take some time to figure out what is supposed to be happening here.

The pipeline operator %>% eschews nesting function calls in favor of passing data from one function to the next (also won’t run):

# hourly_delay <- flights %>% 
#    filter(!is.na(dep_delay)) %>% 
#    group_by(date, hour) %>% 
#    summarise(delay = mean(dep_delay),n = n()) %>% 
#    filter(n > 10)

The beauty of this approach is that it can be ‘read’ aloud to discover what the block of code is meant to do.

The flights data frame is

1. filtered (to remove missing values of the dep_delay variable)
2. grouped by hours within days
3. the mean delay is calculated within groups, and 
4. the mean delay is returned for those hours with more than n > 10 flights.

The pipeline rules are simple: the object on the left hand side is passed as the first argument to the function on the right hand side.

data %>% function is the same as function(data)
data %>% function(arg=value) is the same as function(data, arg=value)

References: https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html

3. TIDY DATA

Tidy data has a specific structure: - each variable is a column - each observation is a row - each type of observational unit is a table

There are two functions in package tidyr to reshape tables to a tidy format: gather() and spread().

gather() requires: - a data frame to reshape, - a key column (against which to reshape), - a value column (which will contain the new variable of interest), and - the indices of the columns that need to be collapsed.

For instance, in a tidy format, the cases dataset would look like:

gather(cases,"year","n",2:4)

##   country  year     n
## 1      FR X2011  7000
## 2      DE X2011  5800
## 3      US X2011 15000
## 4      FR X2012  6900
## 5      DE X2012  6000
## 6      US X2012 14000
## 7      FR X2013  7000
## 8      DE X2013  6200
## 9      US X2013 13000

spread(), on the other hand, generates multiple columns from two columns; it requires - a data frame to reshape, - a key column, and - values in the value column to become new values.

For instance, in a tidy format, the pollution dataset would look like:

spread(pollution,size,amount)

##       city large small
## 1  Beijing   121    56
## 2   London    22    16
## 3 New York    23    14

gather() and spread() are inverses of one another. Other useful wrangling functions include separate() and unite(). What do you think these do?

4. THE `dplyr` PACKAGE

The dplyr package is useful when it comes to transforming tabular data, and is compatible withthe pipeline operator %>%. Its most useful functions are: - select(): to extract a subset of variables from the data frame - filter(): to extract a subset of observations from the data frame - arrange(): to sort the data frame - mutate(): to create new variables from existing variables - summarise(): to create so-called pivot tables - group_by(): … self-evident?

We will showcase these functions with the help of various examples. Try to guess what the outputs would be before looking at them.

storms.2 <- select(storms, name,pressure)
storms.3 <- select(storms, -name)
storms.4 <- select(storms, lat:pressure)
head(storms)

## # A tibble: 6 x 13
##   name   year month   day  hour   lat  long status category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
## 1 Amy    1975     6    27     0  27.5 -79   tropi… -1          25     1013
## 2 Amy    1975     6    27     6  28.5 -79   tropi… -1          25     1013
## 3 Amy    1975     6    27    12  29.5 -79   tropi… -1          25     1013
## 4 Amy    1975     6    27    18  30.5 -79   tropi… -1          25     1013
## 5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1          25     1012
## 6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1          25     1012
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

head(storms.2)

## # A tibble: 6 x 2
##   name  pressure
##   <chr>    <int>
## 1 Amy       1013
## 2 Amy       1013
## 3 Amy       1013
## 4 Amy       1013
## 5 Amy       1012
## 6 Amy       1012

head(storms.3)

## # A tibble: 6 x 12
##    year month   day  hour   lat  long status category  wind pressure
##   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
## 1  1975     6    27     0  27.5 -79   tropi… -1          25     1013
## 2  1975     6    27     6  28.5 -79   tropi… -1          25     1013
## 3  1975     6    27    12  29.5 -79   tropi… -1          25     1013
## 4  1975     6    27    18  30.5 -79   tropi… -1          25     1013
## 5  1975     6    28     0  31.5 -78.8 tropi… -1          25     1012
## 6  1975     6    28     6  32.4 -78.7 tropi… -1          25     1012
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

head(storms.4)

## # A tibble: 6 x 6
##     lat  long status              category  wind pressure
##   <dbl> <dbl> <chr>               <ord>    <int>    <int>
## 1  27.5 -79   tropical depression -1          25     1013
## 2  28.5 -79   tropical depression -1          25     1013
## 3  29.5 -79   tropical depression -1          25     1013
## 4  30.5 -79   tropical depression -1          25     1013
## 5  31.5 -78.8 tropical depression -1          25     1012
## 6  32.4 -78.7 tropical depression -1          25     1012

storms.5 <- filter(storms, wind>40)
storms.6 <- filter(storms, wind>40, name %in% c("Amy", "Alberto", "Alexis", "Allison"))
nrow(storms)

## [1] 10010

nrow(storms.5)

## [1] 5738

nrow(storms.6)

## [1] 123

storms.7 <- mutate(storms, quotient = wind/pressure)
storms.8 <- mutate(storms, quotient = wind/pressure, inv = 1/quotient)
head(storms.7)

## # A tibble: 6 x 14
##   name   year month   day  hour   lat  long status category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
## 1 Amy    1975     6    27     0  27.5 -79   tropi… -1          25     1013
## 2 Amy    1975     6    27     6  28.5 -79   tropi… -1          25     1013
## 3 Amy    1975     6    27    12  29.5 -79   tropi… -1          25     1013
## 4 Amy    1975     6    27    18  30.5 -79   tropi… -1          25     1013
## 5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1          25     1012
## 6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1          25     1012
## # … with 3 more variables: ts_diameter <dbl>, hu_diameter <dbl>,
## #   quotient <dbl>

head(storms.8)

## # A tibble: 6 x 15
##   name   year month   day  hour   lat  long status category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
## 1 Amy    1975     6    27     0  27.5 -79   tropi… -1          25     1013
## 2 Amy    1975     6    27     6  28.5 -79   tropi… -1          25     1013
## 3 Amy    1975     6    27    12  29.5 -79   tropi… -1          25     1013
## 4 Amy    1975     6    27    18  30.5 -79   tropi… -1          25     1013
## 5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1          25     1012
## 6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1          25     1012
## # … with 4 more variables: ts_diameter <dbl>, hu_diameter <dbl>,
## #   quotient <dbl>, inv <dbl>

pollution %>% summarise(median=median(amount), variance=var(amount))

##   median variance
## 1   22.5   1731.6

pollution %>% summarise(mean=mean(amount), sum=sum(amount), n=n()) # n() doesn't take arguments

##   mean sum n
## 1   42 252 6

storms.9 <- arrange(storms, wind)
storms.10 <- arrange(storms, desc(wind))
storms.11 <- arrange(storms, wind, desc(year),month,day)
head(storms.9)

## # A tibble: 6 x 13
##   name   year month   day  hour   lat  long status category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
## 1 Bonn…  1986     6    28     6  36.5 -91.3 tropi… -1          10     1013
## 2 Bonn…  1986     6    28    12  37.2 -90   tropi… -1          10     1012
## 3 AL03…  1987     8    16    18  30.9 -83.2 tropi… -1          10     1014
## 4 AL03…  1987     8    17     0  31.4 -82.9 tropi… -1          10     1015
## 5 AL03…  1987     8    17     6  31.8 -82.3 tropi… -1          10     1015
## 6 Albe…  1994     7     7     0  32.7 -86.3 tropi… -1          10     1012
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

head(storms.10)

## # A tibble: 6 x 13
##   name   year month   day  hour   lat  long status category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
## 1 Gilb…  1988     9    14     0  19.7 -83.8 hurri… 5          160      888
## 2 Wilma  2005    10    19    12  17.3 -82.8 hurri… 5          160      882
## 3 Gilb…  1988     9    14     6  19.9 -85.3 hurri… 5          155      889
## 4 Mitch  1998    10    26    18  16.9 -83.1 hurri… 5          155      905
## 5 Mitch  1998    10    27     0  17.2 -83.8 hurri… 5          155      910
## 6 Rita   2005     9    22     3  24.7 -87.3 hurri… 5          155      895
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

head(storms.11)

## # A tibble: 6 x 13
##   name   year month   day  hour   lat  long status category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
## 1 Albe…  1994     7     7     0  32.7 -86.3 tropi… -1          10     1012
## 2 Albe…  1994     7     7     6  32.7 -86.6 tropi… -1          10     1012
## 3 Albe…  1994     7     7    12  32.8 -86.8 tropi… -1          10     1012
## 4 Albe…  1994     7     7    18  33   -87   tropi… -1          10     1013
## 5 AL03…  1987     8    16    18  30.9 -83.2 tropi… -1          10     1014
## 6 AL03…  1987     8    17     0  31.4 -82.9 tropi… -1          10     1015
## # … with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

storms %>% select(name, pressure)

## # A tibble: 10,010 x 2
##    name  pressure
##    <chr>    <int>
##  1 Amy       1013
##  2 Amy       1013
##  3 Amy       1013
##  4 Amy       1013
##  5 Amy       1012
##  6 Amy       1012
##  7 Amy       1011
##  8 Amy       1006
##  9 Amy       1004
## 10 Amy       1002
## # … with 10,000 more rows

storms %>% filter(wind>40)

## # A tibble: 5,738 x 13
##    name   year month   day  hour   lat  long status category  wind pressure
##    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
##  1 Amy    1975     6    29    12  33.8 -73.8 tropi… 0           45     1000
##  2 Amy    1975     6    29    18  33.8 -72.8 tropi… 0           50      998
##  3 Amy    1975     6    30     0  34.3 -71.6 tropi… 0           50      998
##  4 Amy    1975     6    30     6  35.6 -70.8 tropi… 0           55      998
##  5 Amy    1975     6    30    12  35.9 -70.5 tropi… 0           60      987
##  6 Amy    1975     6    30    18  36.2 -70.2 tropi… 0           60      987
##  7 Amy    1975     7     1     0  36.2 -69.8 tropi… 0           60      984
##  8 Amy    1975     7     1     6  36.2 -69.4 tropi… 0           60      984
##  9 Amy    1975     7     1    12  36.2 -68.3 tropi… 0           60      984
## 10 Amy    1975     7     1    18  36.7 -67.2 tropi… 0           60      984
## # … with 5,728 more rows, and 2 more variables: ts_diameter <dbl>,
## #   hu_diameter <dbl>

storms %>% filter(wind>40) %>% select(name,pressure)

## # A tibble: 5,738 x 2
##    name  pressure
##    <chr>    <int>
##  1 Amy       1000
##  2 Amy        998
##  3 Amy        998
##  4 Amy        998
##  5 Amy        987
##  6 Amy        987
##  7 Amy        984
##  8 Amy        984
##  9 Amy        984
## 10 Amy        984
## # … with 5,728 more rows

tb %>% group_by(country, year) %>% summarise(cases = sum(child,adult,elderly, na.rm=TRUE))

## # A tibble: 1,900 x 3
## # Groups:   country [100]
##    country      year cases
##    <fct>       <int> <int>
##  1 Afghanistan  1995     0
##  2 Afghanistan  1996     0
##  3 Afghanistan  1997   128
##  4 Afghanistan  1998  1778
##  5 Afghanistan  1999   745
##  6 Afghanistan  2000  2666
##  7 Afghanistan  2001  4639
##  8 Afghanistan  2002  6509
##  9 Afghanistan  2003  6528
## 10 Afghanistan  2004  8245
## # … with 1,890 more rows

storms %>% mutate(quotient = wind / pressure) %>% select(name, quotient)

## # A tibble: 10,010 x 2
##    name  quotient
##    <chr>    <dbl>
##  1 Amy     0.0247
##  2 Amy     0.0247
##  3 Amy     0.0247
##  4 Amy     0.0247
##  5 Amy     0.0247
##  6 Amy     0.0247
##  7 Amy     0.0247
##  8 Amy     0.0298
##  9 Amy     0.0349
## 10 Amy     0.0399
## # … with 10,000 more rows

storms %>% group_by(name) %>% summarise(year = max(year), mean_wind = mean(wind, na.rm=TRUE), mean_pressure = mean(pressure, na.rm=TRUE), mean_ts_diameter = mean(ts_diameter, na.rm=TRUE), mean_hu_diameter = mean(hu_diameter, na.rm=TRUE)) %>% arrange(year,name)

## # A tibble: 198 x 6
##    name      year mean_wind mean_pressure mean_ts_diameter mean_hu_diameter
##    <chr>    <dbl>     <dbl>         <dbl>            <dbl>            <dbl>
##  1 Amy       1975      46.5          995.              NaN              NaN
##  2 Caroline  1975      38.9         1002.              NaN              NaN
##  3 Doris     1975      73.7          983.              NaN              NaN
##  4 Belle     1976      68.1          982.              NaN              NaN
##  5 Anita     1977      72            982.              NaN              NaN
##  6 Clara     1977      40           1005.              NaN              NaN
##  7 Evelyn    1977      51.1         1001.              NaN              NaN
##  8 Amelia    1978      34.2         1008.              NaN              NaN
##  9 Bess      1978      33.5         1009.              NaN              NaN
## 10 Cora      1978      47.6         1002.              NaN              NaN
## # … with 188 more rows

tb %>% group_by(country, year) %>% summarise(cases = sum(child,adult,elderly, na.rm = TRUE)) %>% summarise(cases = sum(cases))

## # A tibble: 100 x 2
##    country                            cases
##    <fct>                              <int>
##  1 Afghanistan                       140225
##  2 Algeria                           128119
##  3 Angola                            308365
##  4 Argentina                         117156
##  5 Azerbaijan                         29965
##  6 Bangladesh                       1524034
##  7 Belarus                            37185
##  8 Benin                              48821
##  9 Bolivia (Plurinational State of)  122555
## 10 Botswana                           71470
## # … with 90 more rows

tb %>% group_by(country, year) %>% summarise(cases = sum(child,adult,elderly, na.rm = TRUE)) %>% summarise(cases = sum(cases)) %>% summarise(cases = sum(cases))

## # A tibble: 1 x 1
##      cases
##      <int>
## 1 42718969

dplyr also comes with “database” functionality (bind_cols(), bind_rows(), union(), intersect(), setdiff(), left_join(), inner_join(), semi_join(), anti_join(), etc.). Consult Grolemund’s slides or the dplyr documentation for more details (a cheatsheet is available here.

5. EXEMPLES

What would the following dataset look like in a tidy format?

data = data.frame(cbind(c("Alex","Alex","Allison","Allison","Bobbie","Bobbie"),
                        c("wind","pressure","wind","pressure","wind","pressure")))

data = cbind(data,c(68,130,55,121,72,118))

colnames(data)=c("storm","stat","value")
data

##     storm     stat value
## 1    Alex     wind    68
## 2    Alex pressure   130
## 3 Allison     wind    55
## 4 Allison pressure   121
## 5  Bobbie     wind    72
## 6  Bobbie pressure   118

str(data)

## 'data.frame':    6 obs. of  3 variables:
##  $ storm: Factor w/ 3 levels "Alex","Allison",..: 1 1 2 2 3 3
##  $ stat : Factor w/ 2 levels "pressure","wind": 2 1 2 1 2 1
##  $ value: num  68 130 55 121 72 118

Solution: Tidy data is a structure for which each observation is in a row, and each variable in its own column. This can best be achieved using the tidyr function spread().

library(tidyr)
(data_tidy <- spread(data,stat,value))

##     storm pressure wind
## 1    Alex      130   68
## 2 Allison      121   55
## 3  Bobbie      118   72

Provide a pivot table for the mean and standard deviation of the wind and the pressure variables.

Solution: this can be achieved using the dplyr function summarise().

library(dplyr)

wind = summarise(data_tidy, mean=mean(as.numeric(wind)), sd=sd(as.numeric(wind)))
pressure = summarise(data_tidy, mean=mean(as.numeric(pressure)), sd=sd(as.numeric(pressure)))

pivot = rbind(wind,pressure)
pivot = cbind(c("wind","pressure"),pivot)

colnames(pivot)[1] = "stat"

pivot

##       stat mean       sd
## 1     wind   65 8.888194
## 2 pressure  123 6.244998

Turn the data found in cities.txt into a tidy dataset.

Solution: the cities.txt data is not in a tidy format since population values are found in two different columns; tidyr’s gather() (whose syntax is a little bit daunting) can help.

# read in the  data
cities <- read.csv('Data/cities.csv')

# add a column of identifiers for the cities
cities = cbind(1:37,cities) 

# this will improve readibility
colnames(cities) <- c("city","category","1999","2019") 

# let's take one last look at the data in non-tidy format
str(cities)

## 'data.frame':    37 obs. of  4 variables:
##  $ city    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ category: Factor w/ 3 levels "large","medium",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ 1999    : int  4590 5210 6030 6580 7300 7940 7960 8430 8750 9080 ...
##  $ 2019    : int  5090 7090 7710 8120 7320 6780 10050 11060 12140 12480 ...

library(tidyr)

# now we gather the information from the city dataset found in the 3rd and 4th column (1999, 2019)
(cities_tidy <- gather(cities,key="year", value="population", 3:4))

##    city category year population
## 1     1    small 1999       4590
## 2     2    small 1999       5210
## 3     3    small 1999       6030
## 4     4    small 1999       6580
## 5     5    small 1999       7300
## 6     6    small 1999       7940
## 7     7    small 1999       7960
## 8     8    small 1999       8430
## 9     9    small 1999       8750
## 10   10    small 1999       9080
## 11   11    small 1999      17330
## 12   12    small 1999      17470
## 13   13    small 1999      17660
## 14   14    small 1999      18990
## 15   15    small 1999      20520
## 16   16    small 1999      21130
## 17   17    small 1999      21220
## 18   18    small 1999      21640
## 19   19    small 1999      29500
## 20   20    small 1999      29830
## 21   21    small 1999      29930
## 22   22    small 1999      30900
## 23   23    small 1999      35520
## 24   24    small 1999      35520
## 25   25    small 1999      39980
## 26   26    small 1999      40940
## 27   27    small 1999      42880
## 28   28    small 1999      43360
## 29   29    small 1999      58080
## 30   30   medium 1999      80720
## 31   31   medium 1999      90480
## 32   32   medium 1999     103190
## 33   33   medium 1999     193340
## 34   34   medium 1999     196320
## 35   35   medium 1999     253200
## 36   36    large 1999    2869350
## 37   37    large 1999    3009490
## 38    1    small 2019       5090
## 39    2    small 2019       7090
## 40    3    small 2019       7710
## 41    4    small 2019       8120
## 42    5    small 2019       7320
## 43    6    small 2019       6780
## 44    7    small 2019      10050
## 45    8    small 2019      11060
## 46    9    small 2019      12140
## 47   10    small 2019      12480
## 48   11    small 2019      21740
## 49   12    small 2019      22640
## 50   13    small 2019      25000
## 51   14    small 2019      23170
## 52   15    small 2019      27710
## 53   16    small 2019      27840
## 54   17    small 2019      22650
## 55   18    small 2019      25000
## 56   19    small 2019      37120
## 57   20    small 2019      39120
## 58   21    small 2019      31160
## 59   22    small 2019      34100
## 60   23    small 2019      45800
## 61   24    small 2019      48280
## 62   25    small 2019      37120
## 63   26    small 2019      49040
## 64   27    small 2019      52250
## 65   28    small 2019      46480
## 66   29    small 2019      63730
## 67   30   medium 2019      81750
## 68   31   medium 2019     106190
## 69   32   medium 2019     110110
## 70   33   medium 2019     243390
## 71   34   medium 2019     293480
## 72   35   medium 2019     244800
## 73   36    large 2019    3652020
## 74   37    large 2019    3742940

Data Wrangling and the Tidyverse

P. Boily

7/31/2020

TUTORIAL OUTLINE

INITIALIZATION

2. PIPELINE OPERATOR `%>%`

3. TIDY DATA

4. THE `dplyr` PACKAGE

5. EXEMPLES

Data Wrangling and the Tidyverse

P. Boily

7/31/2020

TUTORIAL OUTLINE

INITIALIZATION

2. PIPELINE OPERATOR %>%

3. TIDY DATA

4. THE dplyr PACKAGE

5. EXEMPLES

2. PIPELINE OPERATOR `%>%`

4. THE `dplyr` PACKAGE