if(!file.exists("./data/data.csv")){source(getData.R)}
ext_tracks <- read.csv("./data/data.csv")

When cleaning up data, you will need to be able to create subsets of the data, by selecting certain columns or filtering down to certain rows. These actions can be done using the dplyr functionsselect and filter.

The select function subsets certain columns of a data frame. The most basic way to use select is select certain columns by specifying their full column names. For example, to select the storm name, date, time, latitude, longitude, and maximum wind speed from the ext_tracks dataset, you can run:

library(dplyr)

ext_tracks %>% select(storm_name, month, day, hour, year, latitude, longitude, max_wind) %>% head()
##   storm_name month day hour year latitude longitude max_wind
## 1    ALBERTO     8   5   18 1988     32.0      77.5       20
## 2    ALBERTO     8   6    0 1988     32.8      76.2       20
## 3    ALBERTO     8   6    6 1988     34.0      75.2       20
## 4    ALBERTO     8   6   12 1988     35.2      74.6       25
## 5    ALBERTO     8   6   18 1988     37.0      73.5       25
## 6    ALBERTO     8   7    0 1988     38.7      72.4       25

There are several functions you can use with select that give you more flexibility, and so allow you to select columns without specifying the full names of each column. For example, the starts_withfunction can be used within a select function to pick out all the columns that start with a certain text string. As an example of using starts_with in conjunction with select, in the ext_tracks hurricane data, there are a number of columns that say how far from the storm center winds of certain speeds extend. Tropical storms often have asymmetrical wind fields, so these wind radii are given for each quadrant of the storm (northeast, southeast, northwest, and southeast of the storm’s center). All of the columns with the radius to which winds of 34 knots or more extend start with “radius_34”. To get a dataset with storm names, location, and radii of winds of 34 knots, you could run:

ext_tracks %>% select(storm_name, latitude, longitude, starts_with("radius_34")) %>% head()
##   storm_name latitude longitude radius_34_ne radius_34_se radius_34_sw
## 1    ALBERTO     32.0      77.5            0            0            0
## 2    ALBERTO     32.8      76.2            0            0            0
## 3    ALBERTO     34.0      75.2            0            0            0
## 4    ALBERTO     35.2      74.6            0            0            0
## 5    ALBERTO     37.0      73.5            0            0            0
## 6    ALBERTO     38.7      72.4            0            0            0
##   radius_34_nw
## 1            0
## 2            0
## 3            0
## 4            0
## 5            0
## 6            0

Other functions that can be used with select in a similar way include:

While select picks out certain columns of the data frame, filter picks out certain rows. Withfilter, you can specify certain conditions using R’s logical operators, and the function will return rows that meet those conditions.

R’s logical operators include:

If you are ever unsure of how to write a logical statement, but know how to write its opposite, you can use the ! operator to negate the whole statement. For example, if you wanted to get all storms exceptthose named “KATRINA” and “ANDREW”, you could use !(storm_name %in% c(“KATRINA”, “ANDREW”)). A common use of this is to identify observations with non-missing data (e.g., !(is.na(radius_34_ne))).

A logical statement, run by itself on a vector, will return a vector of the same length with TRUE every time the condition is met and FALSEevery time it is not.

head(ext_tracks$hour)
## [1] 18  0  6 12 18  0

When you use a logical statement within filter, it will return just the rows where the logical statement is true:

ext_tracks %>% select(storm_name, hour, max_wind) %>% head(9)
##   storm_name hour max_wind
## 1    ALBERTO   18       20
## 2    ALBERTO    0       20
## 3    ALBERTO    6       20
## 4    ALBERTO   12       25
## 5    ALBERTO   18       25
## 6    ALBERTO    0       25
## 7    ALBERTO    6       30
## 8    ALBERTO   12       35
## 9    ALBERTO   18       35

Filtering can also be done after summarizing data. For example, to determine which storms had maximum wind speed equal to or above 160 knots, run:

ext_tracks %>%
  group_by(storm_name, year) %>% summarize(worst_wind = max(max_wind)) %>% filter(worst_wind >= 160)
## Source: local data frame [2 x 3]
## Groups: storm_name [2]
## 
##   storm_name  year worst_wind
##       <fctr> <int>      <int>
## 1    GILBERT  1988        160
## 2      WILMA  2005        160

If you would like to string several logical conditions together and select rows where all or any of the conditions are true, you can use the “and” (&) or “or” (|) operators. For example, to pull out observations for Hurricane Andrew when it was at or above Category 5 strength (137 knots or higher), you could run:

ext_tracks %>% select(storm_name, month, day, hour, latitude, longitude, max_wind) %>% filter(storm_name == "ANDREW" & max_wind >= 137) 
##   storm_name month day hour latitude longitude max_wind
## 1     ANDREW     8  23   12     25.4      74.2      145
## 2     ANDREW     8  23   18     25.4      75.8      150

Some common errors that come up when using logical operators in R are: