class: center, middle, inverse, title-slide .title[ # BANL 6100: Business Analytics ] .subtitle[ ## Data Transformation ] .author[ ### Mehmet Balcilar
mbalcilar@newhaven.edu
] .institute[ ### Univeristy of New Haven ] .date[ ### 2023-08-28 (updated: 2024-09-26) ] --- class: center, middle, sydney-blue # Data Transformation in R --- ## Core **dplyr** functions - `filter()`: select rows by their values - `arrange()`: order rows - `select()`: select columns by their names - `mutate()`: create new variables - `summarize()`: collapse many values down to a single summary - `group_by()`: operate on it group-by-group - `rename()`: rename columns - `distinct()`: find distinct rows Command structure (for all `dplyr` verbs): - first argument is a data frame - return value is a data frame - nothing is modified in place --- ## Introduction ### Data transformation helps you get the data in exactly the right form you need. -- - create new variables - create summaries - rename variables - reorder observations - ...and more! -- ### This chapter uses the `nycflights13` 📦 - Data set includes 336,776 flights that departed from New York City in 2013 - US Bureau of Transportation Statistics ``` r library(tidyverse) # install.packages('nycflights13') # install nycflights13 package, if needed library(nycflights13) ?flights ``` --- ## Introduction ``` r flights %>% glimpse() ``` ``` Rows: 336,776 Columns: 19 $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201… $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, 558, 558, 558, 559, 559, 559, 600, 600, 601, 602, 602, 606, 606, 607, 608, 611, 613, 615, 615, 622, 623, 623, 624, 624, 627, 628, 628, 629, … $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, 600, 600, 600, 600, 559, 600, 600, 600, 600, 610, 605, 610, 610, 607, 600, 600, 610, 615, 615, 630, 610, 627, 630, 630, 630, 630, 630, 630, … $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1, 0, -1, 0, 0, 1, -8, -3, -4, -4, 0, 8, 11, 3, 0, 0, -8, 13, -4, -6, -6, -3, -2, -2, -1, -1, -1, 24, 0, -8, -1, -3, -2, 8, -2, 1, 1, -4, -3,… $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849, 853, 924, 923, 941, 702, 854, 851, 837, 844, 812, 821, 858, 837, 858, 807, 945, 925, 1039, 833, 1017, 920, 933, 909, 840, 1018, 1137, 1016,… $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851, 856, 917, 937, 910, 706, 902, 858, 825, 850, 820, 805, 910, 845, 915, 735, 931, 921, 1100, 842, 1014, 915, 932, 840, 830, 1018, 1140, 947, … $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -14, 31, -4, -8, -7, 12, -6, -8, 16, -12, -8, -17, 32, 14, 4, -21, -9, 3, 5, 1, 29, 10, 0, -3, 29, 14, -19, -9, 12, 48, -5, -10, -18, -11, -9,… $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "AA", "B6", "B6", "UA", "UA", "AA", "B6", "UA", "B6", "MQ", "B6", "DL", "MQ", "AA", "DL", "UA", "MQ", "UA", "B6", "B6", "DL", "US", "AA", "UA… $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 49, 71, 194, 1124, 707, 1806, 1187, 371, 4650, 343, 1919, 4401, 1895, 1743, 1077, 3768, 303, 135, 709, 575, 245, 1837, 496, 4626, 4599, 27, 4… $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39463", "N516JB", "N829AS", "N593JB", "N3ALAA", "N793JB", "N657JB", "N29129", "N53441", "N3DUAA", "N708JB", "N76515", "N595JB", "N542MQ", "N644… $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA", "JFK", "LGA", "JFK", "JFK", "JFK", "EWR", "LGA", "JFK", "EWR", "LGA", "LGA", "EWR", "LGA", "LGA", "EWR", "JFK", "EWR", "EWR", "JFK", "JFK",… $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO", "ORD", "PBI", "TPA", "LAX", "SFO", "DFW", "BOS", "LAS", "FLL", "ATL", "PBI", "MSP", "DTW", "MIA", "ATL", "MIA", "ORD", "SFO", "RSW",… $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 158, 345, 361, 257, 44, 337, 152, 134, 147, 170, 105, 152, 128, 157, 139, 366, 175, 182, 120, 342, 153, 229, 190, 166, 330, 192, 366, 140, 40… $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, 1028, 1005, 2475, 2565, 1389, 187, 2227, 1076, 762, 1023, 1020, 502, 1085, 760, 1085, 719, 2586, 1074, 1598, 746, 2133, 1096, 1416, 1008, 10… $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 6, 7, 7, 7, 6, 7, 7, 7, 7, 7, 7, … $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0, 10, 5, 10, 10, 7, 0, 0, 10, 15, 15, 30, 10, 27, 30, 30, 30, 30, 30, 30, 30, 30, 8, 35, 45, 40, 46, 45, 36, 47, 45, 45, 55, 55, 0,… $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 06:00:00, 2013-01-01 05:00:00, 2013-01-01 06:00:00, 2013-01-01 06:00:00, 2013-01-01 06:00:00, 2013-… ``` --- ## Introduction ``` r head(flights, 4) ``` ``` # A tibble: 4 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00 ``` -- .pull-left[ - `int`: integers - `dbl`: doubles, or real numbers - `chr`: character vectors, or strings - `dttm`: date-times (a date + a time) ] -- .pull-right[ - `lgl`: logical, vectors with only TRUE or FALSE - `fctr`: factors, categorical variables - `date`: dates ] --- ## Introduction ### **dplyr** basics - Pick observations by their values (`filter()`). - Reorder the rows (`arrange()`). - Pick variables by their names (`select()`). - Create new variables with functions of existing variables (`mutate()`). - Collapse many values down to a single summary (`summarise()`). -- - **+** `group_by()` which gets the above functions to operate group-by-group rather than on the entire dataset ### How do they work? 1. First argument is data frame 2. Other arguments describe what to do with the data frame 3. Result is a new data frame! --- ## Filter rows with `filter()` Lets you subset observations based on their values. -- <img src="images/filter.png" width="750px" style="display: block; margin: auto;" /> -- ``` r filter(flights, month == 1, day == 1) ``` ``` # A tibble: 842 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 1065 6 0 2013-01-01 06:00:00 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6 0 2013-01-01 06:00:00 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 944 6 0 2013-01-01 06:00:00 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6 0 2013-01-01 06:00:00 # ℹ 832 more rows ``` --- ### Comparisons R provides the standard suite of comparison operators: - `>` - `>=` - `<` - `<=` - `!=` (not equal) - `==` (equal) --- ### Logical operators - `&` is "and" - `|` is "or" - `!` is "not" .center[ <img src="https://d33wubrfki0l68.cloudfront.net/01f4b6d39d2be8269740a3ad7946faa79f7243cf/8369a/diagrams/transform-logical.png" height="400" style="display: block; margin: auto;" /> ] --- ### Missing values - Missing values a.k.a. `NA`s a.k.a. "not availables" represent an unknown value -- - They are contagious! 🦠 ...meaning almost any operation involving an unknown value will also be unknown -- ### Also important because... `filter()` only includes rows where the condition is `TRUE` (i.e. it does not include both `FALSE` and `NA` values) -- If you want to preserve missing values you must ask for them explicitly: ``` r filter(df, is.na(x) | x > 1) ``` --- ### Ex: Find all flights that were operated by United, American, or Delta ``` # A tibble: 2 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00 ``` -- ``` [1] "UA" "AA" "B6" "DL" "EV" "MQ" "US" "WN" "VX" "FL" "AS" "9E" "F9" "HA" "YV" "OO" ``` -- ``` r # option 1 flights %>% filter(carrier = "UA" | "AA" | "DL") # option 2 flights %>% filter(carrier == "UA" | "AA" | "DL") # option 3 flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL") # option 4 flights %>% filter(carrier %in% c("UA", "AA", "DL")) ``` --- #### Ex: Find all flights that were operated by United, American, or Delta ``` r flights %>% filter(carrier = "UA" | "AA" | "DL") ``` > Error: Problem with `filter()` input `..1`. x Input `..1` is named. ℹ This usually means that you've used `=` instead of `==`. ℹ Did you mean `carrier == "UA" | "AA" | "DL"`? --- #### Ex: Find all flights that were operated by United, American, or Delta ``` r flights %>% filter(carrier == "UA" | "AA" | "DL") ``` > Error: Problem with `filter()` input `..1`. x operations are possible only for numeric, logical or complex types ℹ Input `..1` is `carrier == "UA" | "AA" | "DL"`. --- #### Ex: Find all flights that were operated by United, American, or Delta ``` r flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL") ``` ``` # A tibble: 139,504 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00 4 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00 5 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00 6 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6 0 2013-01-01 06:00:00 7 2013 1 1 558 600 -2 924 917 7 UA 194 N29129 JFK LAX 345 2475 6 0 2013-01-01 06:00:00 8 2013 1 1 558 600 -2 923 937 -14 UA 1124 N53441 EWR SFO 361 2565 6 0 2013-01-01 06:00:00 9 2013 1 1 559 600 -1 941 910 31 AA 707 N3DUAA LGA DFW 257 1389 6 0 2013-01-01 06:00:00 10 2013 1 1 559 600 -1 854 902 -8 UA 1187 N76515 EWR LAS 337 2227 6 0 2013-01-01 06:00:00 # ℹ 139,494 more rows ``` --- #### Ex: Find all flights that were operated by United, American, or Delta ``` r flights %>% filter(carrier %in% c("UA", "AA", "DL")) ``` ``` # A tibble: 139,504 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00 4 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00 5 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00 6 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6 0 2013-01-01 06:00:00 7 2013 1 1 558 600 -2 924 917 7 UA 194 N29129 JFK LAX 345 2475 6 0 2013-01-01 06:00:00 8 2013 1 1 558 600 -2 923 937 -14 UA 1124 N53441 EWR SFO 361 2565 6 0 2013-01-01 06:00:00 9 2013 1 1 559 600 -1 941 910 31 AA 707 N3DUAA LGA DFW 257 1389 6 0 2013-01-01 06:00:00 10 2013 1 1 559 600 -1 854 902 -8 UA 1187 N76515 EWR LAS 337 2227 6 0 2013-01-01 06:00:00 # ℹ 139,494 more rows ``` --- ## `filter()` exercises Find all flights that a. Had an arrival delay of two or more hours. b. Arrived more than two hours late, but didn't leave late. c. Flew to Houston (`IAH` or `HOU`). d. Were operated by United, American, or Delta. e. Departed in summer (July, August, and September). --- ## `filter()` exercises solutions Arrival delay of two or more hours ``` r flights %>% filter(arr_delay > 120) ``` Arrived more than two hours late, but didn't leave late ``` r flights %>% filter(!is.na(dep_delay), dep_delay <= 0, arr_delay > 120) ``` Flew to Houston (`IAH` or `HOU`) ``` r flights %>% filter(dest %in% c("IAH", "HOU")) ``` --- ## `filter()` exercises solutions Were operated by United, American, or Delta ``` r flights %>% filter(carrier %in% c("AA", "DL", "UA")) ``` Departed in summer (July, August, and September) ``` r flights %>% filter(between(month, 7, 9)) ``` --- ## Arrange rows with `arrange()` ### Changes the order of the rows Takes a data frame and orders it by a set of column names (or more complicated expressions) -- - orders a column in ascending order by default - `desc()` can be used to reorder a column in `desc`ending order -- - **Note:** missing values (`NA`s) are always sorted at the end --- ### Ex: Sort flights to find the most delayed flights. Find the flights that left earliest. ``` r flights ``` ``` # A tibble: 336,776 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 1065 6 0 2013-01-01 06:00:00 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6 0 2013-01-01 06:00:00 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 944 6 0 2013-01-01 06:00:00 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6 0 2013-01-01 06:00:00 # ℹ 336,766 more rows ``` --- #### Sort flights to find the most delayed flights ``` r arrange(flights, desc(dep_delay)) ``` ``` # A tibble: 3 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm> 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384HA JFK HNL 640 4983 9 0 2013-01-09 09:00:00 2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ 3535 N504MQ JFK CMH 74 483 19 35 2013-06-15 19:00:00 3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ 3695 N517MQ EWR ORD 111 719 16 35 2013-01-10 16:00:00 ``` -- #### Find the flights that left earliest ``` r arrange(flights, dep_delay) ``` ``` # A tibble: 3 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm> 1 2013 12 7 2040 2123 -43 40 2352 48 B6 97 N592JB JFK DEN 265 1626 21 23 2013-12-07 21:00:00 2 2013 2 3 2022 2055 -33 2240 2338 -58 DL 1715 N612DL LGA MSY 162 1183 20 55 2013-02-03 20:00:00 3 2013 11 10 1408 1440 -32 1549 1559 -10 EV 5713 N825AS LGA IAD 52 229 14 40 2013-11-10 14:00:00 ``` --- ## `arrange()` exercises Sort `flights` to a. find the flight that departed the earliest (earlier than scheduled). b. find the most delayed flight. c. find the flight that travelled the longest and that travelled the shortest distance. --- ## `arrange()` exercises solutions Flight B6 97 left the earliest (43 minutes early). ``` r flights %>% arrange(dep_delay) ``` Flight HA 51 is the most delayed flight (1301 minutes delay). ``` r flights %>% arrange(desc(dep_delay)) ``` Flight HA 51 is the longest (4,983 miles), flight US 1632 is the shortest (17 miles). ``` r flights %>% arrange(desc(distance)) ``` --- ## Select columns with `select()` <img src="images/select.png" width="750px" style="display: block; margin: auto;" /> Sometimes we get data sets that have hundreds or thousands of variables 😬 -- `select()` gives us a way to narrow down to only those variables we're interested in! 🎉 --- ## Select columns with `select()` ### Helper functions include - `starts_with("abc")`: matches names that begin with “abc”. - `ends_with("xyz")`: matches names that end with “xyz”. - `contains("ijk")`: matches names that contain “ijk”. - `matches("(.)\\1")`: selects variables that match a regular expression. This one matches any variables that contain repeated characters (see **Chapter 14: Strings** in `R4DS`) - `num_range("x", 1:3)`: matches x1, x2 and x3. -- - ⭐ `everything()`: useful when you want to move a few variables to the start of the data frame --- ### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights. ``` r flights ``` ``` # A tibble: 336,776 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm> 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 1065 6 0 2013-01-01 06:00:00 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6 0 2013-01-01 06:00:00 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 944 6 0 2013-01-01 06:00:00 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6 0 2013-01-01 06:00:00 # ℹ 336,766 more rows ``` --- #### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights. ``` r select(flights, dep_time, dep_delay, arr_time, arr_delay) ``` ``` # A tibble: 336,776 × 4 dep_time dep_delay arr_time arr_delay <int> <dbl> <int> <dbl> 1 517 2 830 11 2 533 4 850 20 3 542 2 923 33 4 544 -1 1004 -18 5 554 -6 812 -25 6 554 -4 740 12 7 555 -5 913 19 8 557 -3 709 -14 9 557 -3 838 -8 10 558 -2 753 8 # ℹ 336,766 more rows ``` --- #### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights. ``` r select(flights, starts_with("dep") | starts_with("arr")) ``` ``` # A tibble: 336,776 × 4 dep_time dep_delay arr_time arr_delay <int> <dbl> <int> <dbl> 1 517 2 830 11 2 533 4 850 20 3 542 2 923 33 4 544 -1 1004 -18 5 554 -6 812 -25 6 554 -4 740 12 7 555 -5 913 19 8 557 -3 709 -14 9 557 -3 838 -8 10 558 -2 753 8 # ℹ 336,766 more rows ``` --- #### Ex: Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from flights. ``` r select(flights, ends_with("time") | ends_with("delay")) ``` ``` # A tibble: 336,776 × 7 dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay arr_delay <int> <int> <int> <int> <dbl> <dbl> <dbl> 1 517 515 830 819 227 2 11 2 533 529 850 830 227 4 20 3 542 540 923 850 160 2 33 4 544 545 1004 1022 183 -1 -18 5 554 600 812 837 116 -6 -25 6 554 558 740 728 150 -4 12 7 555 600 913 854 158 -5 19 8 557 600 709 723 53 -3 -14 9 557 600 838 846 140 -3 -8 10 558 600 753 745 138 -2 8 # ℹ 336,766 more rows ``` ## Not quite! Why? --- ## `select()` exercises 1. What are three distinct ways to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`. 2. What does the `one_of()` function do? Why might it be helpful in conjunction with this vector? ``` r vars <- c("year", "month", "day", "dep_delay", "arr_delay") ``` --- ## `select()` exercise solutions Three distinct ways to select variables. ``` r select(flights, dep_time, dep_delay, arr_time, arr_delay) ``` ``` r select(flights, starts_with("dep_"), starts_with("arr_")) ``` ``` r select(flights, matches("^(dep|arr)_(time|delay)$")) ``` What does the `one_of()` function do? ``` r vars <- c("year", "month", "day", "dep_delay", "arr_delay") select(flights, one_of(vars)) ``` --- ## Add new variables with `mutate()` Helps you add new columns that are functions of existing columns <img src="images/mutate.png" width="750px" style="display: block; margin: auto;" /> -- - always adds columns at the end of your dataset -- - if you want to only keep the new variables, you can use `transmute()` --- ### Add new variables with `mutate()` .pull-left[ Can be used with _arithmetic operators_ - `+` - `-` - `*` - `/` - `^` - `sum()` ] -- .pull-right[ _modular arithmetic_ - `%/%` (integer division) - `%%` (remainder) - `log()` - `log2()` - `log10()` ] -- .pull-left[ - _logical comparisons_: `<`, `<=`, `>`, `>=`, `!=` - _offsets_: `lead()` and `lag()` - _ranking_: `min_rank()`, `row_number()`, `dense_rank()`, and more! ] --- ## `mutate()` ``` r flights %>% select(ends_with("delay"), distance, air_time) %>% mutate(gain = arr_delay - dep_delay, speed = distance / air_time * 60 ) ``` | dep_delay| arr_delay| distance| air_time| gain| speed| |---------:|---------:|--------:|--------:|----:|--------:| | 2| 11| 1400| 227| 9| 370.0441| | 4| 20| 1416| 227| 16| 374.2731| | 2| 33| 1089| 160| 31| 408.3750| | -1| -18| 1576| 183| -17| 516.7213| | -6| -25| 762| 116| -19| 394.1379| | -4| 12| 719| 150| 16| 287.6000| --- ## `transmute()` Use `transmute()` to only keep the new variables: ``` r transmute(flights, gain = arr_delay - dep_delay, hours = air_time / 60, gain_per_hour = gain / hours ) ``` --- ## `mutate()` exercises Use `mutate()` to 1. Create new variables for `dep_time` and `sched_dep_time` that measure time in the number of minutes after midnight. 2. Care `air_time` with `arr_time - dep_time`. What do you see? What do you need to do to fix it? 3. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? --- ## `mutate()` exercise solutions New variables for `dep_time` and `sched_dep_time` ``` r flights %>% mutate(dep_time2 = dep_time %/% 100 * 60 + dep_time %% 100, sched_dep_time2 = sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %>% select(dep_time, dep_time2, sched_dep_time, sched_dep_time2) ``` --- ## `mutate()` exercise solutions `arr_time` and `dep_time` may be in different time zones. ``` r flights %>% mutate(air_time2 = arr_time - dep_time, air_time_diff = air_time2 - air_time) %>% filter(air_time_diff != 0) %>% select(air_time, air_time2, dep_time, arr_time, dest) ``` 10 most delayed flights ``` r flights %>% mutate(dep_delay_rank = min_rank(-dep_delay)) %>% arrange(dep_delay_rank) %>% filter(dep_delay_rank <= 10) ``` --- ## `summarize()` `summarize()` collapses a data frame to a single row. <img src="images/summarise.png" width="750px" style="display: block; margin: auto;" /> ``` r summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) ``` ``` # A tibble: 1 × 1 delay <dbl> 1 12.6 ``` --- ## `summarize()` with `group_by()` `summarize()` is most effectively used with `group_by()`, which changes the unit of analysis from the complete dataset to individual groups. <img src="images/group.png" width="250px" style="display: block; margin: auto;" /> Grouping is most useful in conjunction with `summarise()`, but you can also do convenient operations with `mutate()` and `filter()`. --- ## `summarize()` with `group_by()` For example, to get the average delay per date ``` r flights %>% group_by(year, month, day) %>% summarise(delay = mean(dep_delay, na.rm = TRUE)) ``` --- ## `summarize()` count For aggregations it is generally a good idea to include a count `n()`. For example, let’s look at the (not cancelled) planes that have the highest average delays: ``` r flights %>% filter(!is.na(dep_delay), !is.na(arr_delay)) group_by(tailnum) %>% summarise(delay = mean(arr_delay)) %>% arrange(delay) ``` --- ## `summarize()` useful functions There are a number of useful summary functions: * Measures of location: `mean(x)`, `sum(x)`, `median(x)`. * Measures of spread: `sd(x)`, `IQR(x)`, `mad(x)`. * Measures of rank: `min(x)`, `quantile(x, 0.25)`, `max(x)`. * Measures of position: `first(x)`, `nth(x, 2)`, `last(x)`. * Counts: `n()`, `sum(!is.na(x))`, `n_distinct(x)`. * Counts and proportions of logical values: `sum(x > 10)`, `mean(y == 0)`. --- ## `summarize()` exercises Use `summarize()` to. 1. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? 2. Find the carrier with the worst delays. --- ## `summarize()` exercises solutions Number of cancelled flights per day ``` r cancelled_delayed <- flights %>% mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>% group_by(year, month, day) %>% summarise(prop_cancelled = mean(cancelled), avg_dep_delay = mean(dep_delay, na.rm = TRUE)) ``` <img src="5-Data-Transformation_files/figure-html/unnamed-chunk-50-1.png" width="432" style="display: block; margin: auto;" /> --- ## `summarize()` exercises solutions Frontier Airlines (FL) has the worst delays. ``` r flights %>% group_by(carrier) %>% summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>% arrange(desc(arr_delay)) ``` --- ## Grouped summaries with `summarize()` Collapses a data frame into a single row and more useful when combined with `group_by()` -- Arguably **most** useful when used with the pipe `%>%` operator -- Go from this: ``` r by_dest <- group_by(flights, dest) delay <- summarize(by_dest, count = n(), dis = mean(distance, na.rm = TRUE), delay = mean(arr_del, na.rm = TRUE)) ``` -- To this! 🎉 ``` r flights %>% group_by(dest) %>% summarize(by_dest, count = n(), dis = mean(distance, na.rm = TRUE), delay = mean(arr_del, na.rm = TRUE)) ``` #### Spot the differences! --- ## Grouped mutate (and filters) Most useful in combination with `summarize()` -- For example: ``` r popular_dests <- flights %>% group_by(dest) %>% filter(n() > 365) ``` ### What does it do?