Data manipulation with tidyr

Introduction
Get Data
tidyr Functions

Introduction

tidyr is a package by Hadley Wickham that makes it easy to tidy your data. It is often used in conjunction with dplyr. Data is tidy when each column represents a variable and each row represents an observation.

tidyr is part of the tidyverse. tidyverse is a set of packages that work in harmony because they share common data representations and API design. tidyverse is designed to make it easy to install and load core packages from the tidyverse in a single command. I recommend you become intimately familiar with these tools.

Explore tidyrfunctions:

gather – converts wide data to longer format. It is analogous to melt in reshape2.
spread – converts long data to wider format. It is analogous to the cast in reshape2.
unite – combines two or more columns into a single column.
separate – splits one column into two or more columns.

reshape2 is another popular R package. Like many function in R, there are many ways to accomplish any given task. With all due respect to reshape2, I prefer to use the tools in tidyverse.

Get Data

Use the mtcars dataset:

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Include the names of the cars in a column called car.

mtcars$car <- rownames(mtcars)
mtcars <- mtcars[, c(12, 1:11)]
head(mtcars)

##                                 car  mpg cyl disp  hp drat    wt  qsec vs
## Mazda RX4                 Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0
## Mazda RX4 Wag         Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0
## Datsun 710               Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1
## Hornet 4 Drive       Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1
## Hornet Sportabout Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0
## Valiant                     Valiant 18.1   6  225 105 2.76 3.460 20.22  1
##                   am gear carb
## Mazda RX4          1    4    4
## Mazda RX4 Wag      1    4    4
## Datsun 710         1    4    1
## Hornet 4 Drive     0    3    1
## Hornet Sportabout  0    3    2
## Valiant            0    3    1

tidyr Functions

tidyr - gather

gather takes the form: gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)

where … is the specification of the columns to gather.

mtcarsNew <- mtcars %>% gather(attribute, value, -car)
head(mtcarsNew)

##                 car attribute value
## 1         Mazda RX4       mpg  21.0
## 2     Mazda RX4 Wag       mpg  21.0
## 3        Datsun 710       mpg  22.8
## 4    Hornet 4 Drive       mpg  21.4
## 5 Hornet Sportabout       mpg  18.7
## 6           Valiant       mpg  18.1

It gathers all the columns except car and places their name and value into the attritube and value column.

The great thing about tidyr is that you can gather only certain columns and leave the others alone. If we want to gather all the columns from mpg to gear and leave the carb and car columns as they are we do this:

mtcarsNew <- mtcars %>% gather(attribute, value, mpg:gear)
head(mtcarsNew)

##                 car carb attribute value
## 1         Mazda RX4    4       mpg  21.0
## 2     Mazda RX4 Wag    4       mpg  21.0
## 3        Datsun 710    1       mpg  22.8
## 4    Hornet 4 Drive    1       mpg  21.4
## 5 Hornet Sportabout    2       mpg  18.7
## 6           Valiant    1       mpg  18.1

tidyr - spread

spread takes the form: spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE)

Here is an example:

mtcarsSpread <- mtcarsNew %>% spread(attribute, value)
head(mtcarsSpread)

##                  car carb am cyl disp drat gear  hp  mpg  qsec vs    wt
## 1        AMC Javelin    2  0   8  304 3.15    3 150 15.2 17.30  0 3.435
## 2 Cadillac Fleetwood    4  0   8  472 2.93    3 205 10.4 17.98  0 5.250
## 3         Camaro Z28    4  0   8  350 3.73    3 245 13.3 15.41  0 3.840
## 4  Chrysler Imperial    4  0   8  440 3.23    3 230 14.7 17.42  0 5.345
## 5         Datsun 710    1  1   4  108 3.85    4  93 22.8 18.61  1 2.320
## 6   Dodge Challenger    2  0   8  318 2.76    3 150 15.5 16.87  0 3.520

tidyr - unite

unite takes the form: unite(data, col, ..., sep = "_", remove = TRUE)

where … represents the columns to unite and col represents the column to add.

We need new data to demo this::

date <- as.Date('2017-07-01') + 0:14
hour <- sample(1:24, 15)
min <- sample(1:60, 15)
second <- sample(1:60, 15)
event <- sample(letters, 15)
data <- data.frame(date, hour, min, second, event)
head(data,5)

##         date hour min second event
## 1 2017-07-01    1  42     22     y
## 2 2017-07-02    7  43     29     f
## 3 2017-07-03    2  53     48     q
## 4 2017-07-04    8  31     53     e
## 5 2017-07-05   18  56     17     m

Combine the date, hour, min, and second columns into a new column called datetime. Datetime in R is of the form Year-Month-Day Hour:Min:Second.

dataNew <- data %>% unite(datehour, date, hour, sep = ' ') %>% unite(datetime, datehour, min, second, sep = ':')
head(dataNew, 5)

##              datetime event
## 1  2017-07-01 1:42:22     y
## 2  2017-07-02 7:43:29     f
## 3  2017-07-03 2:53:48     q
## 4  2017-07-04 8:31:53     e
## 5 2017-07-05 18:56:17     m

tidyr - separate

separate takes the form: separate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...)

Revert back to the original data using separate:

data1 <- dataNew %>% separate(datetime, c('date', 'time'), sep = ' ') %>% separate(time, c('hour', 'min', 'second'), sep = ':')
head(data1,5)

##         date hour min second event
## 1 2017-07-01    1  42     22     y
## 2 2017-07-02    7  43     29     f
## 3 2017-07-03    2  53     48     q
## 4 2017-07-04    8  31     53     e
## 5 2017-07-05   18  56     17     m

It first splits the datetime column into date and time, and then splits time into hour, min, and second.