BANL 6100: Business Analytics

class: center, middle, inverse, title-slide

.title[
# BANL 6100: Business Analytics
]
.subtitle[
## Data Transformation
]
.author[
### Mehmet Balcilar </br> <a href="mailto:mbalcilar@newhaven.edu" class="email">mbalcilar@newhaven.edu</a>
]
.institute[
### Univeristy of New Haven
]
.date[
### 2023-08-28 (updated: 2024-09-26)
]

---

class: center, middle, sydney-blue

# Data Transformation in R

---

## Core **dplyr** functions

- `filter()`: select rows by their values
- `arrange()`: order rows
- `select()`: select columns by their names
- `mutate()`: create new variables
- `summarize()`: collapse many values down to a single summary
- `group_by()`: operate on it group-by-group
- `rename()`: rename columns
- `distinct()`: find distinct rows

Command structure (for all `dplyr` verbs):

- first argument is a data frame
- return value is a data frame
- nothing is modified in place

---

## Introduction

### Data transformation helps you get the data in exactly the right form you need.

- create new variables
- create summaries
- rename variables
- reorder observations
- ...and more!

### This chapter uses the `nycflights13` 📦

- Data set includes 336,776 flights that departed from New York City in 2013 
- US Bureau of Transportation Statistics

``` r
library(tidyverse)
# install.packages('nycflights13')   # install nycflights13 package, if needed
library(nycflights13)
?flights
```
---

## Introduction

``` r
flights %>%
  glimpse()
```

```
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, 558, 559, 559, 559, 600, 600, 601, 602, 602, 606, 606, 607, 608, 611, 613, 615, 615, 622, 623, 623, 624, 624, 627, 628, 628, 629, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, 600, 600, 559, 600, 600, 600, 600, 610, 605, 610, 610, 607, 600, 600, 610, 615, 615, 630, 610, 627, 630, 630, 630, 630, 630, 630, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0, 0, 1, -8, -3, -4, -4, 0, 8, 11, 3, 0, 0, -8, 13, -4, -6, -6, -3, -2, -2, -1, -1, -1, 24, 0, -8, -1, -3, -2, 8, -2, 1, 1, -4, -3,…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924, 923, 941, 702, 854, 851, 837, 844, 812, 821, 858, 837, 858, 807, 945, 925, 1039, 833, 1017, 920, 933, 909, 840, 1018, 1137, 1016,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917, 937, 910, 706, 902, 858, 825, 850, 820, 805, 910, 845, 915, 735, 931, 921, 1100, 842, 1014, 915, 932, 840, 830, 1018, 1140, 947, …
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4, -8, -7, 12, -6, -8, 16, -12, -8, -17, 32, 14, 4, -21, -9, 3, 5, 1, 29, 10, 0, -3, 29, 14, -19, -9, 12, 48, -5, -10, -18, -11, -9,…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6", "B6", "UA", "UA", "AA", "B6", "UA", "B6", "MQ", "B6", "DL", "MQ", "AA", "DL", "UA", "MQ", "UA", "B6", "B6", "DL", "US", "AA", "UA…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194, 1124, 707, 1806, 1187, 371, 4650, 343, 1919, 4401, 1895, 1743, 1077, 3768, 303, 135, 709, 575, 245, 1837, 496, 4626, 4599, 27, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516JB", "N829AS", "N593JB", "N3ALAA", "N793JB", "N657JB", "N29129", "N53441", "N3DUAA", "N708JB", "N76515", "N595JB", "N542MQ", "N644…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "LGA", "JFK", "JFK", "JFK", "EWR", "LGA", "JFK", "EWR", "LGA", "LGA", "EWR", "LGA", "LGA", "EWR", "JFK", "EWR", "EWR", "JFK", "JFK",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO", "ORD", "PBI", "TPA", "LAX", "SFO", "DFW", "BOS", "LAS", "FLL", "ATL", "PBI", "MSP", "DTW", "MIA", "ATL", "MIA", "ORD", "SFO", "RSW",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 345, 361, 257, 44, 337, 152, 134, 147, 170, 105, 152, 128, 157, 139, 366, 175, 182, 120, 342, 153, 229, 190, 166, 330, 192, 366, 140, 40…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, 1005, 2475, 2565, 1389, 187, 2227, 1076, 762, 1023, 1020, 502, 1085, 760, 1085, 719, 2586, 1074, 1598, 746, 2133, 1096, 1416, 1008, 10…
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 6, 7, 7, 7, 6, 7, 7, 7, 7, 7, 7, …
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0, 10, 5, 10, 10, 7, 0, 0, 10, 15, 15, 30, 10, 27, 30, 30, 30, 30, 30, 30, 30, 30, 8, 35, 45, 40, 46, 45, 36, 47, 45, 45, 55, 55, 0,…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 06:00:00, 2013-01-01 05:00:00, 2013-01-01 06:00:00, 2013-01-01 06:00:00, 2013-01-01 06:00:00, 2013-…
```

---

## Introduction

``` r
head(flights, 4)
```

```
# A tibble: 4 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute time_hour          
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl> <dttm>             
1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15 2013-01-01 05:00:00
2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29 2013-01-01 05:00:00
3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40 2013-01-01 05:00:00
4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45 2013-01-01 05:00:00
```

.pull-left[
- `int`: integers 
- `dbl`: doubles, or real numbers
- `chr`: character vectors, or strings 
- `dttm`: date-times (a date + a time)

]

.pull-right[
- `lgl`: logical, vectors with only TRUE or FALSE
- `fctr`: factors, categorical variables
- `date`: dates
]

---

## Introduction

### **dplyr** basics

- Pick observations by their values (`filter()`).
- Reorder the rows (`arrange()`).
- Pick variables by their names (`select()`).
- Create new variables with functions of existing variables (`mutate()`).
- Collapse many values down to a single summary (`summarise()`).
--

- **+** `group_by()` which gets the above functions to operate group-by-group rather than on the entire dataset

### How do they work?

1. First argument is data frame
2. Other arguments describe what to do with the data frame
3. Result is a new data frame!

---

## Filter rows with `filter()`

Lets you subset observations based on their values.

``` r
filter(flights, month == 1, day == 1)
```

```
# A tibble: 842 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute time_hour          
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl> <dttm>             
 1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15 2013-01-01 05:00:00
 2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29 2013-01-01 05:00:00
 3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40 2013-01-01 05:00:00
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45 2013-01-01 05:00:00
 5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116      762     6      0 2013-01-01 06:00:00
 6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD        150      719     5     58 2013-01-01 05:00:00
 7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB  EWR    FLL        158     1065     6      0 2013-01-01 06:00:00
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA    IAD         53      229     6      0 2013-01-01 06:00:00
 9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB  JFK    MCO        140      944     6      0 2013-01-01 06:00:00
10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0 2013-01-01 06:00:00
# ℹ 832 more rows
```

---

### Comparisons

R provides the standard suite of comparison operators:
- `>`
- `>=`
- `<`
- `<=`
- `!=` (not equal)
- `==` (equal)

---

### Logical operators

- `&` is "and"
- `|` is "or"
- `!` is "not"

.center[
<img src="https://d33wubrfki0l68.cloudfront.net/01f4b6d39d2be8269740a3ad7946faa79f7243cf/8369a/diagrams/transform-logical.png" height="400" style="display: block; margin: auto;" />
]

---

### Missing values

- Missing values a.k.a. `NA`s a.k.a. "not availables" represent an unknown value

- They are contagious! 🦠 ...meaning almost any operation involving an unknown value will also be unknown

### Also important because...

`filter()` only includes rows where the condition is `TRUE` (i.e. it does not include both `FALSE` and `NA` values)

If you want to preserve missing values you must ask for them explicitly:

``` r
filter(df, is.na(x) | x > 1)
```

---

### Ex: Find all flights that were operated by United, American, or Delta

```
# A tibble: 2 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute time_hour          
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl> <dttm>             
1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15 2013-01-01 05:00:00
2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29 2013-01-01 05:00:00
```

```
 [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV" "OO"
```

``` r
# option 1
flights %>% filter(carrier = "UA" | "AA" | "DL")

# option 2
flights %>% filter(carrier == "UA" | "AA" | "DL")

# option 3
flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL")

# option 4
flights %>% filter(carrier %in% c("UA", "AA", "DL"))
```

---

#### Ex: Find all flights that were operated by United, American, or Delta

``` r
flights %>% filter(carrier = "UA" | "AA" | "DL")
```
> Error: Problem with `filter()` input `..1`. x Input `..1` is named. ℹ This usually means that you've used `=` instead of `==`. ℹ Did you mean `carrier == "UA" | "AA" | "DL"`?

---
#### Ex: Find all flights that were operated by United, American, or Delta

``` r
flights %>% filter(carrier == "UA" | "AA" | "DL")
```
> Error: Problem with `filter()` input `..1`. x operations are possible only for numeric, logical or complex types ℹ Input `..1` is `carrier == "UA" | "AA" | "DL"`.

---
#### Ex: Find all flights that were operated by United, American, or Delta

``` r
flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL")
```

```
# A tibble: 139,504 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute time_hour          
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl> <dttm>             
 1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15 2013-01-01 05:00:00
 2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29 2013-01-01 05:00:00
 3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40 2013-01-01 05:00:00
 4  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116      762     6      0 2013-01-01 06:00:00
 5  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD        150      719     5     58 2013-01-01 05:00:00
 6  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0 2013-01-01 06:00:00
 7  2013     1     1      558            600        -2      924            917         7 UA         194 N29129  JFK    LAX        345     2475     6      0 2013-01-01 06:00:00
 8  2013     1     1      558            600        -2      923            937       -14 UA        1124 N53441  EWR    SFO        361     2565     6      0 2013-01-01 06:00:00
 9  2013     1     1      559            600        -1      941            910        31 AA         707 N3DUAA  LGA    DFW        257     1389     6      0 2013-01-01 06:00:00
10  2013     1     1      559            600        -1      854            902        -8 UA        1187 N76515  EWR    LAS        337     2227     6      0 2013-01-01 06:00:00
# ℹ 139,494 more rows
```

---
#### Ex: Find all flights that were operated by United, American, or Delta

``` r
flights %>% filter(carrier %in% c("UA", "AA", "DL"))
```

---

## `filter()` exercises

Find all flights that

a.  Had an arrival delay of two or more hours.  
  b.  Arrived more than two hours late, but didn't leave late.  
  c.  Flew to Houston (`IAH` or `HOU`).  
  d.  Were operated by United, American, or Delta.  
  e.  Departed in summer (July, August, and September).

---

## `filter()` exercises solutions

Arrival delay of two or more hours

``` r
flights %>% 
  filter(arr_delay > 120)
```

Arrived more than two hours late, but didn't leave late

``` r
flights %>% 
filter(!is.na(dep_delay), dep_delay <= 0, arr_delay > 120)
```

Flew to Houston (`IAH` or `HOU`)

``` r
flights %>%
  filter(dest %in% c("IAH", "HOU"))
```

---

## `filter()` exercises solutions

Were operated by United, American, or Delta

``` r
flights %>% 
filter(carrier %in% c("AA", "DL", "UA"))
```

Departed in summer (July, August, and September)

``` r
flights %>% 
filter(between(month, 7, 9))
```

---

## Arrange rows with `arrange()`

### Changes the order of the rows
Takes a data frame and orders it by a set of column names (or more complicated expressions)

- orders a column in ascending order by default
- `desc()` can be used to reorder a column in `desc`ending order
--

- **Note:** missing values (`NA`s) are always sorted at the end

---

### Ex: Sort flights to find the most delayed flights. Find the flights that left earliest.

``` r
flights
```

```
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute time_hour          
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl> <dttm>             
 1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227     1400     5     15 2013-01-01 05:00:00
 2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227     1416     5     29 2013-01-01 05:00:00
 3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40 2013-01-01 05:00:00
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45 2013-01-01 05:00:00
 5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116      762     6      0 2013-01-01 06:00:00
 6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD        150      719     5     58 2013-01-01 05:00:00
 7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB  EWR    FLL        158     1065     6      0 2013-01-01 06:00:00
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA    IAD         53      229     6      0 2013-01-01 06:00:00
 9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB  JFK    MCO        140      944     6      0 2013-01-01 06:00:00
10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0 2013-01-01 06:00:00
# ℹ 336,766 more rows
```

---

#### Sort flights to find the most delayed flights

``` r
arrange(flights, desc(dep_delay))
```

```
# A tibble: 3 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute time_hour          
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl> <dttm>             
1  2013     1     9      641            900      1301     1242           1530      1272 HA          51 N384HA  JFK    HNL        640     4983     9      0 2013-01-09 09:00:00
2  2013     6    15     1432           1935      1137     1607           2120      1127 MQ        3535 N504MQ  JFK    CMH         74      483    19     35 2013-06-15 19:00:00
3  2013     1    10     1121           1635      1126     1239           1810      1109 MQ        3695 N517MQ  EWR    ORD        111      719    16     35 2013-01-10 16:00:00
```
--
#### Find the flights that left earliest

``` r
arrange(flights, dep_delay)
```

```
# A tibble: 3 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time distance  hour minute time_hour          
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl> <dttm>             
1  2013    12     7     2040           2123       -43       40           2352        48 B6          97 N592JB  JFK    DEN        265     1626    21     23 2013-12-07 21:00:00
2  2013     2     3     2022           2055       -33     2240           2338       -58 DL        1715 N612DL  LGA    MSY        162     1183    20     55 2013-02-03 20:00:00
3  2013    11    10     1408           1440       -32     1549           1559       -10 EV        5713 N825AS  LGA    IAD         52      229    14     40 2013-11-10 14:00:00
```

---

## `arrange()` exercises

Sort `flights` to

a.  find the flight that departed the earliest (earlier than scheduled). 
  b.  find the most delayed flight.  
  c.  find the flight that travelled the longest and that travelled the shortest distance.

---

## `arrange()` exercises solutions

Flight B6 97 left the earliest (43 minutes early).

``` r
flights %>% 
arrange(dep_delay)
```

Flight HA 51 is the most delayed flight (1301 minutes delay).

``` r
flights %>% 
arrange(desc(dep_delay))
```

Flight HA 51 is the longest (4,983 miles), flight US 1632 is the shortest (17 miles).

``` r
flights %>% 
arrange(desc(distance))
```

---

## Select columns with `select()`

Sometimes we get data sets that have hundreds or thousands of variables 😬

`select()` gives us a way to narrow down to only those variables we're interested in! 🎉

---

## Select columns with `select()`

### Helper functions include
- `starts_with("abc")`: matches names that begin with “abc”.
- `ends_with("xyz")`: matches names that end with “xyz”.
- `contains("ijk")`: matches names that contain “ijk”.
- `matches("(.)\\1")`: selects variables that match a regular expression. This one matches any variables that contain repeated characters (see **Chapter 14: Strings** in `R4DS`)
- `num_range("x", 1:3)`: matches x1, x2 and x3.

- ⭐ `everything()`: useful when you want to move a few variables to the start of the data frame

---

### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights.

``` r
flights
```

---
#### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights.

``` r
select(flights, dep_time, dep_delay, arr_time, arr_delay)
```

```
# A tibble: 336,776 × 4
   dep_time dep_delay arr_time arr_delay
      <int>     <dbl>    <int>     <dbl>
 1      517         2      830        11
 2      533         4      850        20
 3      542         2      923        33
 4      544        -1     1004       -18
 5      554        -6      812       -25
 6      554        -4      740        12
 7      555        -5      913        19
 8      557        -3      709       -14
 9      557        -3      838        -8
10      558        -2      753         8
# ℹ 336,766 more rows
```

---
#### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights.

``` r
select(flights, starts_with("dep") | starts_with("arr"))
```

---
#### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights.

``` r
select(flights, ends_with("time") | ends_with("delay"))
```

```
# A tibble: 336,776 × 7
   dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay arr_delay
      <int>          <int>    <int>          <int>    <dbl>     <dbl>     <dbl>
 1      517            515      830            819      227         2        11
 2      533            529      850            830      227         4        20
 3      542            540      923            850      160         2        33
 4      544            545     1004           1022      183        -1       -18
 5      554            600      812            837      116        -6       -25
 6      554            558      740            728      150        -4        12
 7      555            600      913            854      158        -5        19
 8      557            600      709            723       53        -3       -14
 9      557            600      838            846      140        -3        -8
10      558            600      753            745      138        -2         8
# ℹ 336,766 more rows
```

## Not quite! Why?

---

## `select()` exercises

1.  What are three distinct ways to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
  2.  What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?

``` r
vars <- c("year", "month", "day", "dep_delay",
          "arr_delay")
```
    
---

## `select()` exercise solutions

Three distinct ways to select variables.

``` r
select(flights, dep_time, dep_delay, arr_time, arr_delay)
```

``` r
select(flights, starts_with("dep_"), starts_with("arr_"))
```

``` r
select(flights, matches("^(dep|arr)_(time|delay)$"))
```

What does the `one_of()` function do?

``` r
vars <- c("year", "month", "day", "dep_delay",
          "arr_delay")
select(flights, one_of(vars))
```

---

## Add new variables with `mutate()`

Helps you add new columns that are functions of existing columns

- always adds columns at the end of your dataset
--

- if you want to only keep the new variables, you can use `transmute()`

---

### Add new variables with `mutate()`

.pull-left[
Can be used with _arithmetic operators_
- `+`
- `-`
- `*`
- `/`
- `^`
- `sum()`
]

.pull-right[

_modular arithmetic_
- `%/%` (integer division)
- `%%` (remainder)
- `log()`
- `log2()`
- `log10()`
]

.pull-left[

- _logical comparisons_: `<`, `<=`, `>`, `>=`, `!=`
- _offsets_: `lead()` and `lag()`
- _ranking_: `min_rank()`, `row_number()`, `dense_rank()`, and more!
]

---

## `mutate()`

``` r
flights %>% 
  select(ends_with("delay"), distance, air_time) %>% 
  mutate(gain = arr_delay - dep_delay,
         speed = distance / air_time * 60
)
```

| dep_delay| arr_delay| distance| air_time| gain|    speed|
|---------:|---------:|--------:|--------:|----:|--------:|
|         2|        11|     1400|      227|    9| 370.0441|
|         4|        20|     1416|      227|   16| 374.2731|
|         2|        33|     1089|      160|   31| 408.3750|
|        -1|       -18|     1576|      183|  -17| 516.7213|
|        -6|       -25|      762|      116|  -19| 394.1379|
|        -4|        12|      719|      150|   16| 287.6000|

---

## `transmute()`

Use `transmute()` to only keep the new variables:

``` r
transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
```

---
## `mutate()` exercises

Use `mutate()` to

1.  Create new variables for `dep_time` and `sched_dep_time` that measure time in the number of minutes after midnight. 
  2.  Care `air_time` with `arr_time - dep_time`. What do you see? What do you need to do to fix it?  
  3.  Find the 10 most delayed flights using a ranking function. How do you want to handle ties?

---

## `mutate()` exercise solutions

New variables for `dep_time` and `sched_dep_time`

``` r
flights %>% 
  mutate(dep_time2 = dep_time %/% 100 * 60 + 
           dep_time %% 100,
         sched_dep_time2 = sched_dep_time %/% 100 *
           60 + sched_dep_time %% 100) %>%
  select(dep_time, dep_time2, sched_dep_time,
         sched_dep_time2)
```

---

## `mutate()` exercise solutions

`arr_time` and `dep_time` may be in different time zones.

``` r
flights %>% 
  mutate(air_time2 = arr_time - dep_time,
       air_time_diff = air_time2 - air_time) %>%
  filter(air_time_diff != 0) %>%
  select(air_time, air_time2, dep_time, arr_time, dest)
```

10 most delayed flights

``` r
flights %>% 
  mutate(dep_delay_rank = min_rank(-dep_delay)) %>%
  arrange(dep_delay_rank) %>% 
  filter(dep_delay_rank <= 10)
```

---

## `summarize()`

`summarize()` collapses a data frame to a single row.

``` r
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
```

```
# A tibble: 1 × 1
  delay
  <dbl>
1  12.6
```

---

## `summarize()` with `group_by()`
`summarize()` is most effectively used with `group_by()`, which changes the unit of analysis from the complete dataset to individual groups.

Grouping is most useful in conjunction with `summarise()`, but you can also do convenient operations with `mutate()` and `filter()`.

---

## `summarize()` with `group_by()`
For example, to get the average delay per date

``` r
flights %>% 
  group_by(year, month, day) %>% 
  summarise(delay = mean(dep_delay, na.rm = TRUE))
```

---

## `summarize()` count
For aggregations it is generally a good idea to include a count `n()`. For example, let’s look at the (not cancelled) planes that have the highest average delays:

``` r
flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))
  group_by(tailnum) %>% 
  summarise(delay = mean(arr_delay)) %>% 
  arrange(delay)
```

---

## `summarize()` useful functions

There are a number of useful summary functions:

* Measures of location: `mean(x)`, `sum(x)`, `median(x)`.
* Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`.
* Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`.
* Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`.
* Counts: `n()`, `sum(!is.na(x))`, `n_distinct(x)`.
* Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`.

---

## `summarize()` exercises

Use `summarize()` to.  
  1.  Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?  
  2.  Find the carrier with the worst delays.

---

## `summarize()` exercises solutions

Number of cancelled flights per day

``` r
cancelled_delayed <- 
  flights %>%
  mutate(cancelled = (is.na(arr_delay) | 
                        is.na(dep_delay))) %>%
  group_by(year, month, day) %>%
  summarise(prop_cancelled = mean(cancelled),
            avg_dep_delay = mean(dep_delay, na.rm = TRUE))
```

---

## `summarize()` exercises solutions

Frontier Airlines (FL) has the worst delays.

``` r
flights %>%
  group_by(carrier) %>%
  summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(arr_delay))
```

---

## Grouped summaries with `summarize()`

Collapses a data frame into a single row and more useful when combined with `group_by()`

Arguably **most** useful when used with the pipe `%>%` operator

Go from this:

``` r
by_dest <- group_by(flights, dest)
delay <- summarize(by_dest, count = n(),
                   dis = mean(distance, na.rm = TRUE),
                   delay = mean(arr_del, na.rm = TRUE))
```

--
To this! 🎉

``` r
flights %>% 
  group_by(dest) %>% 
  summarize(by_dest, count = n(),
            dis = mean(distance, na.rm = TRUE),
            delay = mean(arr_del, na.rm = TRUE))
```

#### Spot the differences!

---

## Grouped mutate (and filters)

Most useful in combination with `summarize()`

For example:

``` r
popular_dests <- flights %>%
  group_by(dest) %>%
  filter(n() > 365)
```

### What does it do?