class: center, middle, inverse, title-slide .title[ # BANL 6100: Business Analytics ] .subtitle[ ## Data Processing in R ] .author[ ### Mehmet Balcilar
mbalcilar@newhaven.edu
] .institute[ ### Univeristy of New Haven ] .date[ ### 2023-08-28 (updated: 2023-09-18) ] --- class: center, middle background-image: url(images/ba1.jpg) background-position: 20% 10% # .red[ .fat[ Tidyverse & Tidy Data ] ] --- class: center, middle, sydney-blue # Data Workflow — Tidyverse --- ## Key points - Data wrangling - Tidy format - How to make you data tidy - `tidyverse` package ### It is not a programming unit! We concentrate on **DATA**!!! --- ## Data wrangling or **data munging** is the process of transforming and mapping data from one *"raw"* data form into another format to make it more appropriate and valuable for further processing.  --- ## Data preparation Library `tidyverse` [https://www.tidyverse.org/](https://www.tidyverse.org/)  --- ## Updated workflow  --- ## Install `tidyverse` ```r #Step 1 install.packages("tidyverse") #run only ONCE to install! #Step 2 library(tidyverse) #run in every file that uses tidyverse functions ``` `tidyverse` is a collection of packages combined to help with data wrangling More help is at https://www.tidyverse.org/ --- ## Data import ### Sources of data: - Flat files (e.g. `csv` and `txt`) read_csv() write_csv() - "old" read.csv() write.csv() - readxl() - writexl() --- ## Read Avocado data ### avocado.csv The Avocado dataset includes consumption of fruit in different regions of USA from 2015 till 2018 years of data. Variables: - Date: The date of the observation - AveragePrice: the average price of a single avocado - Total Volume: Total number of avocados sold - 4046: Total number of avocados with PLU 4046 sold - 4225: Total number of avocados with PLU 4225 sold - 4770: Total number of avocados with PLU 4770 sold - Small Bags - Large Bags - XLarge Bags: extra large bags - Total Bags - type: organic, conventional - year - region --- ```r avocado <- read_csv("data/avocado.csv") head(avocado) ``` ``` # A tibble: 6 × 14 ...1 Date AveragePrice `Total Volume` `4046` `4225` `4770` `Total Bags` `Small Bags` `Large Bags` `XLarge Bags` type year region <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> 1 0 2015-12-27 1.33 64237. 1037. 54455. 48.2 8697. 8604. 93.2 0 conventional 2015 Albany 2 1 2015-12-20 1.35 54877. 674. 44639. 58.3 9506. 9408. 97.5 0 conventional 2015 Albany 3 2 2015-12-13 0.93 118220. 795. 109150. 130. 8145. 8042. 103. 0 conventional 2015 Albany 4 3 2015-12-06 1.08 78992. 1132 71976. 72.6 5811. 5677. 134. 0 conventional 2015 Albany 5 4 2015-11-29 1.28 51040. 941. 43838. 75.8 6184. 5986. 198. 0 conventional 2015 Albany 6 5 2015-11-22 1.26 55980. 1184. 48068. 43.6 6684. 6556. 127. 0 conventional 2015 Albany ``` --- # Data import: dates/times `ymd()` `ymd_hms()` `dmy()` `dmy_hms()` `mdy()` ```r #install.packages("lubridate") library(lubridate) #Example 1 ymd(20200830) ``` ``` [1] "2020-08-30" ``` ```r date<-"20190130" date %>% ymd() ``` ``` [1] "2019-01-30" ``` ```r #Example 2 mdy("1/24/18") ``` ``` [1] "2018-01-24" ``` --- ## Data basics ### graduate-programs.csv Data from a study on US grad programs. Originally came in an excel file containing rankings of many different programs. Contains information on four programs: - Astronomy - Economics - Entomology, and - Psychology --- ```r library(tidyverse) grad <- read_csv("data/graduate-programs.csv") grad %>% top_n(10) ``` ``` # A tibble: 10 × 16 subject Inst AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDegree PctMinorityFac PctFemaleFac PctFemaleStud PctIntlStud AvNumPhDs AvGREs TotFac PctAsstProf NumStud <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 economics BOSTON UNIVERSITY 0.49 2.66 36.9 34.2 5.5 0 66.7 37.2 87.2 11.6 788 34 26 148 2 economics UNIVERSITY OF CALIFORNIA-BERKELEY 0.79 2.68 71.4 42.6 5.7 4.5 10.9 34.7 34.1 23 790 60 20 170 3 economics UNIVERSITY OF CHICAGO 0.61 3.44 54.8 62.5 5.5 6.9 3.2 19.8 72.5 26.4 800 70 9 182 4 economics UNIVERSITY OF MARYLAND COLLEGE PARK 0.61 1.81 50.4 37.9 5.3 16.2 15.4 38.2 70.7 22 791 52 17 152 5 psychology NORTHERN ILLINOIS UNIVERSITY 0.92 1.32 45.8 32.1 7 3.4 44.8 75.2 4.5 12.2 631 35 20 157 6 psychology UNIVERSITY OF CALIFORNIA-LOS ANGELES 1.42 4.45 72 48.9 6 11.7 34.4 63.6 8 23.8 737 82 18 162 7 psychology UNIVERSITY OF CONNECTICUT 1.15 3.47 69.5 46.8 6 3.6 41.8 63.9 15.2 17.6 680 86 17 158 8 psychology UNIVERSITY OF FLORIDA 1.05 1.72 65.6 34.2 6 5.7 30 66.1 13.7 9.2 672 58 3 168 9 psychology UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 1.39 3.3 57.2 28.1 6.17 7.1 37.9 65.1 26.4 22 742 110 25 175 10 psychology UNIVERSITY OF MINNESOTA-TWIN CITIES 2.02 2.89 69.9 39 6 1.8 25 55.5 12.9 17 737 66 18 155 ``` --- ## Piping [`%>%`] Piping is a special symbol `%>%` that allows you to "shuffle" your data from one step of processing to another set. Compare the use of `head()` and `top_n()` function with the same dataset .pull-left[ ```r avocado %>% top_n(15) ``` ``` # A tibble: 335 × 14 ...1 Date AveragePrice `Total Volume` `4046` `4225` `4770` `Total Bags` `Small Bags` `Large Bags` `XLarge Bags` type year region <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> 1 0 2015-12-27 0.71 776404. 451905. 141599. 15487. 167414. 123158. 33065. 11190 conventional 2015 WestTexNewMexico 2 1 2015-12-20 0.83 649886. 389111. 108176. 12954. 139645. 90393. 23536. 25717. conventional 2015 WestTexNewMexico 3 2 2015-12-13 0.78 646042. 437781. 100110. 13576. 94574. 83053. 10948. 573. conventional 2015 WestTexNewMexico 4 3 2015-12-06 0.74 623232. 398871. 133434. 21088. 69838. 68234. 1605. 0 conventional 2015 WestTexNewMexico 5 4 2015-11-29 0.81 519028. 335447. 103636. 11463. 68483. 67265. 1218. 0 conventional 2015 WestTexNewMexico 6 5 2015-11-22 0.78 615077. 403133. 122078. 12009. 77857. 74953. 2904. 0 conventional 2015 WestTexNewMexico 7 6 2015-11-15 0.71 732695. 483181. 155779. 12781. 80955. 77112. 3843. 0 conventional 2015 WestTexNewMexico 8 7 2015-11-08 0.76 661406. 445281. 123746. 11968 80411. 79009. 1402. 0 conventional 2015 WestTexNewMexico 9 8 2015-11-01 0.8 592116. 350882. 131577. 32803. 76855. 75700. 1155. 0 conventional 2015 WestTexNewMexico 10 9 2015-10-25 0.82 635874. 363487. 166608. 31960. 73819. 72718. 1101. 0 conventional 2015 WestTexNewMexico # ℹ 325 more rows ``` ] .pull-right[ ```r head(avocado) ``` ``` # A tibble: 6 × 14 ...1 Date AveragePrice `Total Volume` `4046` `4225` `4770` `Total Bags` `Small Bags` `Large Bags` `XLarge Bags` type year region <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> 1 0 2015-12-27 1.33 64237. 1037. 54455. 48.2 8697. 8604. 93.2 0 conventional 2015 Albany 2 1 2015-12-20 1.35 54877. 674. 44639. 58.3 9506. 9408. 97.5 0 conventional 2015 Albany 3 2 2015-12-13 0.93 118220. 795. 109150. 130. 8145. 8042. 103. 0 conventional 2015 Albany 4 3 2015-12-06 1.08 78992. 1132 71976. 72.6 5811. 5677. 134. 0 conventional 2015 Albany 5 4 2015-11-29 1.28 51040. 941. 43838. 75.8 6184. 5986. 198. 0 conventional 2015 Albany 6 5 2015-11-22 1.26 55980. 1184. 48068. 43.6 6684. 6556. 127. 0 conventional 2015 Albany ``` ] --- ### Glimpse 🤭 ```r avocado %>% glimpse() ``` ``` Rows: 18,249 Columns: 14 $ ...1 <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 5… $ Date <date> 2015-12-27, 2015-12-20, 2015-12-13, 2015-12-06, 2015-11-29, 2015-11-22, 2015-11-15, 2015-11-08, 2015-11-01, 2015-10-25, 2015-10-18, 2015-10-11, 2015-10-04, 2015-09-27, 2015-09-20, 2015-09-13, 20… $ AveragePrice <dbl> 1.33, 1.35, 0.93, 1.08, 1.28, 1.26, 0.99, 0.98, 1.02, 1.07, 1.12, 1.28, 1.31, 0.99, 1.33, 1.28, 1.11, 1.07, 1.34, 1.33, 1.12, 1.45, 1.11, 1.26, 1.05, 1.35, 1.37, 1.27, 1.32, 1.07, 1.23, 1.19, 1.4… $ `Total Volume` <dbl> 64236.62, 54876.98, 118220.22, 78992.15, 51039.60, 55979.78, 83453.76, 109428.33, 99811.42, 74338.76, 84843.44, 64489.17, 61007.10, 106803.39, 69759.01, 76111.27, 99172.96, 105693.84, 79992.09, 8… $ `4046` <dbl> 1036.74, 674.28, 794.70, 1132.00, 941.48, 1184.27, 1368.92, 703.75, 1022.15, 842.40, 924.86, 1582.03, 2268.32, 1204.88, 1028.03, 985.73, 879.45, 689.01, 733.16, 539.65, 584.63, 509.94, 648.75, 10… $ `4225` <dbl> 54454.85, 44638.81, 109149.67, 71976.41, 43838.39, 48067.99, 73672.72, 101815.36, 87315.57, 64757.44, 75595.85, 52677.92, 49880.67, 99409.21, 59313.12, 65696.86, 90062.62, 94362.67, 67933.79, 686… $ `4770` <dbl> 48.16, 58.33, 130.50, 72.58, 75.78, 43.61, 93.26, 80.00, 85.34, 113.00, 117.07, 105.32, 101.36, 154.84, 150.50, 142.00, 240.79, 335.43, 444.78, 394.90, 368.95, 741.08, 966.61, 2238.02, 4257.64, 5… $ `Total Bags` <dbl> 8696.87, 9505.56, 8145.35, 5811.16, 6183.95, 6683.91, 8318.86, 6829.22, 11388.36, 8625.92, 8205.66, 10123.90, 8756.75, 6034.46, 9267.36, 9286.68, 7990.10, 10306.73, 10880.36, 10443.22, 9225.89, 1… $ `Small Bags` <dbl> 8603.62, 9408.07, 8042.21, 5677.40, 5986.26, 6556.47, 8196.81, 6266.85, 11104.53, 8061.47, 7877.86, 9866.27, 8379.98, 5888.87, 8489.10, 8665.19, 7762.87, 10218.93, 10745.79, 10297.68, 9116.34, 11… $ `Large Bags` <dbl> 93.25, 97.49, 103.14, 133.76, 197.69, 127.44, 122.05, 562.37, 283.83, 564.45, 327.80, 257.63, 376.77, 145.59, 778.26, 621.49, 227.23, 87.80, 134.57, 145.54, 109.55, 78.50, 131.16, 183.99, 108.49,… $ `XLarge Bags` <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 33.33, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.… $ type <chr> "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "co… $ year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201… $ region <chr> "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Alba… ``` --- ## Data basics Let’s have a look at our example: graduate-programs.csv. Now, it’s your turn: "What is the average number of graduate students per economics program?:" "What is the best description of the relationship between number of students and median time to degree?" --- - "The average number of graduate students per economics program is:" - "about 61" (correct) - about 39 "What is the best description of the relationship between number of students and median time to degree?" - "as the number of students increases the median time to degree increases, weakly" (correct) - as the number of students increases the variability in median time to degree decreases --- ## Data vs tidy data “Real data is **messy**. ...It's all **very noisy** out there. Very hard to spot the tune. Like a piano in the next room, it's playing your song, but unfortunately it's out of whack, some of the strings are missing, and the pianist is tone deaf and drunk- I mean, the noise! Impossible!” ― *Tom Stoppard* “The first change we had to make was just to make our data of higher quality. We have **a lot of data**, and sometimes we just weren’t using that data, and we weren’t paying as much attention to its **quality** as we now need to…” - *Ash Gupta* “I think you can have a ridiculously **enormous** and **complex** data set, but if you have the **right tools** and **methodology**, then it’s not a problem.” - *Aaron Koblin* --- ## Tidy data vs messy data .pull-left[ ```r table1 ``` ``` # A tibble: 6 × 4 country year cases population <chr> <dbl> <dbl> <dbl> 1 Afghanistan 1999 745 19987071 2 Afghanistan 2000 2666 20595360 3 Brazil 1999 37737 172006362 4 Brazil 2000 80488 174504898 5 China 1999 212258 1272915272 6 China 2000 213766 1280428583 ``` ```r table2 ``` ``` # A tibble: 12 × 4 country year type count <chr> <dbl> <chr> <dbl> 1 Afghanistan 1999 cases 745 2 Afghanistan 1999 population 19987071 3 Afghanistan 2000 cases 2666 4 Afghanistan 2000 population 20595360 5 Brazil 1999 cases 37737 6 Brazil 1999 population 172006362 7 Brazil 2000 cases 80488 8 Brazil 2000 population 174504898 9 China 1999 cases 212258 10 China 1999 population 1272915272 11 China 2000 cases 213766 12 China 2000 population 1280428583 ``` ] .pull-right[ ```r table3 ``` ``` # A tibble: 6 × 3 country year rate <chr> <dbl> <chr> 1 Afghanistan 1999 745/19987071 2 Afghanistan 2000 2666/20595360 3 Brazil 1999 37737/172006362 4 Brazil 2000 80488/174504898 5 China 1999 212258/1272915272 6 China 2000 213766/1280428583 ``` ```r table4a ``` ``` # A tibble: 3 × 3 country `1999` `2000` <chr> <dbl> <dbl> 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766 ``` ] --- ## Rules of tidy data: - Each variable must have its own column. - Each observation must have its own row. - Each value must have its own cell. What if my data is not tidy? *Use `tidyverse` to tidy it up!*  --- ## Take away points - loading `tidyverse` - tidy data rules - working with functions through piping `%>%`