BANL 6100: Business Analytics

.title[
# BANL 6100: Business Analytics
]
.subtitle[
## Data Processing in R
]
.author[
### Mehmet Balcilar </br> <a href="mailto:mbalcilar@newhaven.edu" class="email">mbalcilar@newhaven.edu</a>
]
.institute[
### Univeristy of New Haven
]
.date[
### 2023-08-28 (updated: 2023-09-18)
]

---

# .red[ .fat[ Tidyverse & Tidy Data ] ]

---
class: center, middle, sydney-blue

# Data Workflow &mdash; Tidyverse

---

## Key points

- Data wrangling

- Tidy format

- How to make you data tidy

- `tidyverse` package

### It is not a programming unit! We concentrate on **DATA**!!!

---

## Data wrangling

or **data munging**

is the process of transforming and mapping data from one *"raw"* data form into another format to make it more appropriate and valuable for further processing.

![ ](images/workflow.png)
---
## Data preparation

Library `tidyverse` [https://www.tidyverse.org/](https://www.tidyverse.org/)

![:scale 40% ](images/tidyverse.png)

---
## Updated workflow

![ ](images/workflow2.png)
---
## Install `tidyverse`

```r
#Step 1
install.packages("tidyverse") #run only ONCE to install!

#Step 2
library(tidyverse) #run in every file that uses tidyverse functions
```

`tidyverse` is a collection of packages combined to help with data wrangling
More help is at https://www.tidyverse.org/ 
---

## Data import

### Sources of data:

- Flat files (e.g. `csv` and `txt`)
read_csv() 
write_csv()

- "old"
read.csv()
write.csv()

- readxl()
- writexl()
---

## Read Avocado data

### avocado.csv

The Avocado dataset includes consumption of fruit in different regions of USA from 2015 till 2018 years of data.

Variables:
    - Date: The date of the observation
    - AveragePrice: the average price of a single avocado
    - Total Volume: Total number of avocados sold
    - 4046: Total number of avocados with PLU 4046 sold
    - 4225: Total number of avocados with PLU 4225 sold
    - 4770: Total number of avocados with PLU 4770 sold
    - Small Bags
    - Large Bags
    - XLarge Bags: extra large bags
    - Total Bags
    - type: organic, conventional
    - year	
    - region

---

```r
avocado <- read_csv("data/avocado.csv")
head(avocado)
```

```
# A tibble: 6 × 14
   ...1 Date       AveragePrice `Total Volume` `4046`  `4225` `4770` `Total Bags` `Small Bags` `Large Bags` `XLarge Bags` type          year region
  <dbl> <date>            <dbl>          <dbl>  <dbl>   <dbl>  <dbl>        <dbl>        <dbl>        <dbl>         <dbl> <chr>        <dbl> <chr> 
1     0 2015-12-27         1.33         64237.  1037.  54455.   48.2        8697.        8604.         93.2             0 conventional  2015 Albany
2     1 2015-12-20         1.35         54877.   674.  44639.   58.3        9506.        9408.         97.5             0 conventional  2015 Albany
3     2 2015-12-13         0.93        118220.   795. 109150.  130.         8145.        8042.        103.              0 conventional  2015 Albany
4     3 2015-12-06         1.08         78992.  1132   71976.   72.6        5811.        5677.        134.              0 conventional  2015 Albany
5     4 2015-11-29         1.28         51040.   941.  43838.   75.8        6184.        5986.        198.              0 conventional  2015 Albany
6     5 2015-11-22         1.26         55980.  1184.  48068.   43.6        6684.        6556.        127.              0 conventional  2015 Albany
```

---
# Data import: dates/times

`ymd()`
`ymd_hms()`
`dmy()`
`dmy_hms()`
`mdy()`

```r
#install.packages("lubridate")
library(lubridate)

#Example 1
ymd(20200830)
```

```
[1] "2020-08-30"
```

```r
date<-"20190130"
date %>% ymd()
```

```
[1] "2019-01-30"
```

```r
#Example 2
mdy("1/24/18")
```

```
[1] "2018-01-24"
```
---
## Data basics

### graduate-programs.csv
Data from a study on US grad programs.

Originally came in an excel file containing rankings of many different programs.

Contains information on four programs:

- Astronomy
- Economics
- Entomology, and
- Psychology

---

```r
library(tidyverse)
grad <- read_csv("data/graduate-programs.csv")
grad %>% 
  top_n(10)
```

```
# A tibble: 10 × 16
   subject    Inst                                       AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDegree PctMinorityFac PctFemaleFac PctFemaleStud PctIntlStud AvNumPhDs AvGREs TotFac PctAsstProf NumStud
   <chr>      <chr>                                          <dbl>     <dbl>        <dbl>         <dbl>              <dbl>          <dbl>        <dbl>         <dbl>       <dbl>     <dbl>  <dbl>  <dbl>       <dbl>   <dbl>
 1 economics  BOSTON UNIVERSITY                               0.49      2.66         36.9          34.2               5.5             0           66.7          37.2        87.2      11.6    788     34          26     148
 2 economics  UNIVERSITY OF CALIFORNIA-BERKELEY               0.79      2.68         71.4          42.6               5.7             4.5         10.9          34.7        34.1      23      790     60          20     170
 3 economics  UNIVERSITY OF CHICAGO                           0.61      3.44         54.8          62.5               5.5             6.9          3.2          19.8        72.5      26.4    800     70           9     182
 4 economics  UNIVERSITY OF MARYLAND COLLEGE PARK             0.61      1.81         50.4          37.9               5.3            16.2         15.4          38.2        70.7      22      791     52          17     152
 5 psychology NORTHERN ILLINOIS UNIVERSITY                    0.92      1.32         45.8          32.1               7               3.4         44.8          75.2         4.5      12.2    631     35          20     157
 6 psychology UNIVERSITY OF CALIFORNIA-LOS ANGELES            1.42      4.45         72            48.9               6              11.7         34.4          63.6         8        23.8    737     82          18     162
 7 psychology UNIVERSITY OF CONNECTICUT                       1.15      3.47         69.5          46.8               6               3.6         41.8          63.9        15.2      17.6    680     86          17     158
 8 psychology UNIVERSITY OF FLORIDA                           1.05      1.72         65.6          34.2               6               5.7         30            66.1        13.7       9.2    672     58           3     168
 9 psychology UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN      1.39      3.3          57.2          28.1               6.17            7.1         37.9          65.1        26.4      22      742    110          25     175
10 psychology UNIVERSITY OF MINNESOTA-TWIN CITIES             2.02      2.89         69.9          39                 6               1.8         25            55.5        12.9      17      737     66          18     155
```
---
## Piping [`%>%`]

Piping is a special symbol `%>%` that allows you to "shuffle" your data from one step of processing to another set.

Compare the use of `head()` and `top_n()` function with the same dataset
.pull-left[

```r
avocado %>% 
  top_n(15)
```

```
# A tibble: 335 × 14
    ...1 Date       AveragePrice `Total Volume`  `4046`  `4225` `4770` `Total Bags` `Small Bags` `Large Bags` `XLarge Bags` type          year region          
   <dbl> <date>            <dbl>          <dbl>   <dbl>   <dbl>  <dbl>        <dbl>        <dbl>        <dbl>         <dbl> <chr>        <dbl> <chr>           
 1     0 2015-12-27         0.71        776404. 451905. 141599. 15487.      167414.      123158.       33065.        11190  conventional  2015 WestTexNewMexico
 2     1 2015-12-20         0.83        649886. 389111. 108176. 12954.      139645.       90393.       23536.        25717. conventional  2015 WestTexNewMexico
 3     2 2015-12-13         0.78        646042. 437781. 100110. 13576.       94574.       83053.       10948.          573. conventional  2015 WestTexNewMexico
 4     3 2015-12-06         0.74        623232. 398871. 133434. 21088.       69838.       68234.        1605.            0  conventional  2015 WestTexNewMexico
 5     4 2015-11-29         0.81        519028. 335447. 103636. 11463.       68483.       67265.        1218.            0  conventional  2015 WestTexNewMexico
 6     5 2015-11-22         0.78        615077. 403133. 122078. 12009.       77857.       74953.        2904.            0  conventional  2015 WestTexNewMexico
 7     6 2015-11-15         0.71        732695. 483181. 155779. 12781.       80955.       77112.        3843.            0  conventional  2015 WestTexNewMexico
 8     7 2015-11-08         0.76        661406. 445281. 123746. 11968        80411.       79009.        1402.            0  conventional  2015 WestTexNewMexico
 9     8 2015-11-01         0.8         592116. 350882. 131577. 32803.       76855.       75700.        1155.            0  conventional  2015 WestTexNewMexico
10     9 2015-10-25         0.82        635874. 363487. 166608. 31960.       73819.       72718.        1101.            0  conventional  2015 WestTexNewMexico
# ℹ 325 more rows
```
]

```r
head(avocado)
```

---
### Glimpse 🤭

```r
avocado %>% 
  glimpse()
```

```
Rows: 18,249
Columns: 14
$ ...1           <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 5…
$ Date           <date> 2015-12-27, 2015-12-20, 2015-12-13, 2015-12-06, 2015-11-29, 2015-11-22, 2015-11-15, 2015-11-08, 2015-11-01, 2015-10-25, 2015-10-18, 2015-10-11, 2015-10-04, 2015-09-27, 2015-09-20, 2015-09-13, 20…
$ AveragePrice   <dbl> 1.33, 1.35, 0.93, 1.08, 1.28, 1.26, 0.99, 0.98, 1.02, 1.07, 1.12, 1.28, 1.31, 0.99, 1.33, 1.28, 1.11, 1.07, 1.34, 1.33, 1.12, 1.45, 1.11, 1.26, 1.05, 1.35, 1.37, 1.27, 1.32, 1.07, 1.23, 1.19, 1.4…
$ `Total Volume` <dbl> 64236.62, 54876.98, 118220.22, 78992.15, 51039.60, 55979.78, 83453.76, 109428.33, 99811.42, 74338.76, 84843.44, 64489.17, 61007.10, 106803.39, 69759.01, 76111.27, 99172.96, 105693.84, 79992.09, 8…
$ `4046`         <dbl> 1036.74, 674.28, 794.70, 1132.00, 941.48, 1184.27, 1368.92, 703.75, 1022.15, 842.40, 924.86, 1582.03, 2268.32, 1204.88, 1028.03, 985.73, 879.45, 689.01, 733.16, 539.65, 584.63, 509.94, 648.75, 10…
$ `4225`         <dbl> 54454.85, 44638.81, 109149.67, 71976.41, 43838.39, 48067.99, 73672.72, 101815.36, 87315.57, 64757.44, 75595.85, 52677.92, 49880.67, 99409.21, 59313.12, 65696.86, 90062.62, 94362.67, 67933.79, 686…
$ `4770`         <dbl> 48.16, 58.33, 130.50, 72.58, 75.78, 43.61, 93.26, 80.00, 85.34, 113.00, 117.07, 105.32, 101.36, 154.84, 150.50, 142.00, 240.79, 335.43, 444.78, 394.90, 368.95, 741.08, 966.61, 2238.02, 4257.64, 5…
$ `Total Bags`   <dbl> 8696.87, 9505.56, 8145.35, 5811.16, 6183.95, 6683.91, 8318.86, 6829.22, 11388.36, 8625.92, 8205.66, 10123.90, 8756.75, 6034.46, 9267.36, 9286.68, 7990.10, 10306.73, 10880.36, 10443.22, 9225.89, 1…
$ `Small Bags`   <dbl> 8603.62, 9408.07, 8042.21, 5677.40, 5986.26, 6556.47, 8196.81, 6266.85, 11104.53, 8061.47, 7877.86, 9866.27, 8379.98, 5888.87, 8489.10, 8665.19, 7762.87, 10218.93, 10745.79, 10297.68, 9116.34, 11…
$ `Large Bags`   <dbl> 93.25, 97.49, 103.14, 133.76, 197.69, 127.44, 122.05, 562.37, 283.83, 564.45, 327.80, 257.63, 376.77, 145.59, 778.26, 621.49, 227.23, 87.80, 134.57, 145.54, 109.55, 78.50, 131.16, 183.99, 108.49,…
$ `XLarge Bags`  <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 33.33, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.…
$ type           <chr> "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "conventional", "co…
$ year           <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ region         <chr> "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Albany", "Alba…
```

---
## Data basics

Let’s have a look at our example: graduate-programs.csv.

Now, it’s your turn:

"What is the average number of graduate students per
economics program?:"

"What is the best description of the relationship between
number of students and median time to degree?"

---
- "The average number of graduate students per economics program is:"
- "about 61" (correct)
- about 39
"What is the best description of the relationship between number of students and median time to degree?"
- "as the number of students increases the median time to degree increases, weakly" (correct)
- as the number of students increases the variability in median time to degree decreases
---

## Data vs tidy data

“Real data is **messy**. ...It's all **very noisy** out there. Very hard to spot the tune. Like a piano in the next room, it's playing your song, but unfortunately it's out of whack, some of the strings are missing, and the pianist is tone deaf and drunk- I mean, the noise! Impossible!” ― *Tom Stoppard*

“The first change we had to make was just to make our data of higher quality. We have **a lot of data**, and sometimes we just weren’t using that data, and we weren’t paying as much attention to its **quality** as we now need to…” - *Ash Gupta*

“I think you can have a ridiculously **enormous** and **complex** data set, but if you have the **right tools** and **methodology**, then it’s not a problem.” - *Aaron Koblin*
---

## Tidy data vs messy data

```r
table1
```

```
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
```

```r
table2
```

```
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
```
]

```r
table3
```

```
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
```

```r
table4a
```

```
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
```
]
---

## Rules of tidy data:

- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.

What if my data is not tidy?

*Use `tidyverse` to tidy it up!*

![ ](images/tidydata.png)
---

## Take away points

- loading `tidyverse`

- tidy data rules

- working with functions through piping `%>%`