class: center, middle, inverse, title-slide .title[ # BANL 6100: Business Analytics ] .subtitle[ ## Data Processing in R ] .author[ ### Mehmet Balcilar
mbalcilar@newhaven.edu
] .institute[ ### Univeristy of New Haven ] .date[ ### 2023-08-28 (updated: 2023-09-18) ] --- class: center, middle, sydney-blue # Importing Data --- ## Import Data In this presentation, we will explore the process of importing data that is stored in a flat file format. Flat file formats are widely used for data storage because they can be read by a wide range of software tools dedicated to handling data. Examples of flat file formats encompass Comma Separated Values (CSV), tab-delimited files, fixed-width files, and others. --- ## `readr` — I The **readr** R package provides straightforward and uniform functions for loading data stored in flat file formats. These readr functions serve as an alternative to the built-in R functions designed for reading flat files. In contrast to base R functions like `read.table()` and its variations, readr functions offer several benefits. These readr functions: * are ~10 times faster * return user-friendly tibbles * have more intuitive defaults. No rownames, no `stringsAsFactors = TRUE`. --- ## `readr` — II `readr` supplies several related functions, each designed to read in a specific flat file format. Function | Reads -------------- | -------------------------- `read_csv()` | Comma separated values `read_csv2()` | Semi-colon separate values `read_delim()` | General delimited files `read_fwf()` | Fixed width files `read_log()` | Apache log files `read_table()` | Space separated files `read_tsv()` | Tab delimited values --- ## `read_csv()` In this context, we'll focus on the `read_csv()` function, although the remaining functions operate in a comparable manner. In the majority of instances, you can employ the same syntax and parameters as `read_csv()` when working with the other functions mentioned earlier. It's worth noting that readr is an integral component of the tidyverse, and it becomes accessible automatically whenever you invoke `library(tidyverse)`. --- ## Sample data The worksheet `bank_credit.csv` file Bank Credit Risk Data provides information about 425 bank customers who had applied for loans. Each of the column is defined as follows: .pull-left[ - `Loan Purpose` : Type of purpose for the loan applied - `Checking` : Checking account balance - `Savings` : Savings account balance - `Months Customer`: Number of months has been a customer of the bank - `Months Employed`: Number of months in employment - `Gender`: Gender ] .pull-right[ - `Marital Status`: Marital status - `Age`: Age in years - `Housing`: Housing type - `Years`: Number of years at current residence - `Job`: Job type - `Credit Risk`: Credit-risk classification by the bank ] --- ## Importing Data — I To import the `bank_credit.csv` file, use the readr functions specifically designed for reading `.csv` files, which is `read_csv()`. Specify the initial argument of `read_csv()` as a character string representing the file path from your current working directory to the `bank_credit.csv` file. Therefore, if you have the `bank_credit.csv` file stored in your working directory, you can read it using the following command. ```r credit <- read_csv("bank_credit.csv") ``` --- ## Importing Data — II **Note:** you can determine the location of your working directory by running `getwd()`. You can change the location of your working directory by going to **Session > Set Working Directory** in the RStudio IDE menu bar. Notice that the code above saves the output to an object named nimbus. You must save the output of `read_csv()` to an object if you wish to use it later. If you do not save the output, `read_csv()` will merely print the contents of the data set at the command line. --- ## Your Turn 1 Find `bank_credit.csv` on your server or computer. Then read it into an object. Then view the results. --- ## Tibbles `read_csv()` reads the data into a **tibble**, which is a special class of data frame. Since a tibble is a sub-class of data frame, R will in most cases treat tibbles in exactly the same way that R treats data frames. R will display tibbles in a much more sensible way. If the **tibble** package is loaded, R will display the first ten rows of the tibble and as many columns as will fit in your console window. This display ensures that the name of each column is visible in the display. tibble is a core member of the tidyverse. It is loaded everytime you call `library(tidyverse)`. The tibble package includes the functions `tibble()` and `tribble()` for makign tibbles from scratch, as well as `as_tibble()` and `as.data.frame()` for converting back and forth between tibbles and data frames. In almost every case, you can ignore whether or not you are working with tibbles or data frames. --- ## Parsing NA's — I Sometimes data files comes with missing values which might be signified with a `.` (dot), `NA`, or some other signifies. Let us read another version of the credit data from the `bank_credit_dirty.csv` file. ```r credit_dirty <- read_csv("bank_credit_dirty.csv") ``` If you examine `credit_dirty` closely, you will notice that two initial values in the `Checking` column are `.`. Can you guess what `.` stands for? It is a missing value like `NA` value. --- ## Parsing NA's — II If you'd like R to treat these `.` values as missing values (and you should) you will need to convert them to `NA`s. One way to do this is to ask `read_csv()` to parse `.` values as `NA` values when it reads in the data. To do this add the argument `na = "."` to `read_csv()`: ```r credit_fixed <- read_csv("bank_credit_dirty.csv", na = ".") ``` You can set `na` to a single character string or a vector of character strings. `read_csv()` will transform every value listed in the `na` argument to an `NA` when it reads in the data. --- ## Parsing data types — I Sometimes `read_csv()` may fail to read column in the desired format. Usually a numerical column may be read as character. In the case of `bank_credit_dirty.csv`, when you use `read_csv()`, `read_csv()` tries to match each column of input to one of the basic data types in R. `read_csv` generally does a good job, but here the initial presence of the character strings `.` caused `read_csv()` to misidentify the contents of the `Checking` column. You can now correct this with R's `as.numeric()` function, or you can read the data in again, this time instructing `read_csv()` to parse the column as numbers. --- ## Parsing data types — II To do this, add the argument `col_types` to `read.csv()` and set it equal to a list. Add a named element to the list for each column you would like to manually parse. The name of the element should match the name of the column you wish to parse. So for example, if we wish to parse the `Checking` column into a specifc data type, we would begin by inserting the argument: ```r credit_fixed <- read_csv("bank_credit_dirty.csv", na = ".", col_types = list(Checking = col_double())) ``` --- ## Parsing data types — III To complete the code, set `Checking` equal to one of the functions below, each function instructs `read_csv()` to parse `Checking` as a specific type of data. .small[ Type function | Data Type ----------------- | ----------------------------------------- `col_character()` | character `col_date()` | Date `col_datetime()` | POSIXct (date-time) `col_double()` | double (numeric) `col_factor()` | factor `col_guess()` | let readr geuss (default) `col_integer()` | integer `col_logical()` | logical `col_number()` | numbers mixed with non-number characters `col_numeric()` | double or integer `col_skip()` | do not read this column `col_time()` | time ] In our case, we would use the `col_double()` function to ensure that `Checking` is read a sa double (that is numeric) column. --- ## Writing data readr also contains functiomns for saving data. These functions parallel the read functions and each save a data frame or tibble in a specific file format. Function | Writes ------------------- | ---------------------------------------- `write_csv()` | Comma separated values `write_excel_csv()` | CSV that you plan to open in Excel `write_delim()` | General delimited files `write_file()` | A single string, written as is `write_lines()` | A vector of strings, one string per line `write_tsv()` | Tab delimited values To use a write function, first give it the name of the data frame to save, then give it a filepath from your wqorking directory to the location where you would like to save the file. This filepath should end in the name of the new file. So we can save the clean `nimbus` data set as a csv in our working directory with ```r write_csv(credit_fixed, path = "bank_credit_clean.csv") ``` --- ## Take Aways The readr package provides efficient functions for reading and saving common flat file data formats. Consider these packages for other types of data: Package | Reads -------- | ----- haven | SPSS, Stata, and SAS files readxl | excel files (.xls, .xlsx) jsonlite | json xml2 | xml httr | web API's rvest | web pages (web scraping) DBI | databases sparklyr | data loaded into spark