class: center, middle, inverse, title-slide .title[ # BANL 6100: Business Analytics ] .subtitle[ ## Bivariate Descriptive Techniques ] .author[ ### Mehmet Balcilar
mbalcilar@newhaven.edu
] .institute[ ### Univeristy of New Haven ] .date[ ### 2023-09-28 (updated: 2024-01-28) ] --- exclude: true --- class: center, middle, sydney-blue <!-- Custom css --> <!-- From xaringancolor --> <div style = "position:fixed; visibility: hidden"> $$ \require{color} \definecolor{purple}{rgb}{0.337254901960784, 0.00392156862745098, 0.643137254901961} \definecolor{navy}{rgb}{0.0509803921568627, 0.23921568627451, 0.337254901960784} \definecolor{ruby}{rgb}{0.603921568627451, 0.145098039215686, 0.0823529411764706} \definecolor{alice}{rgb}{0.0627450980392157, 0.470588235294118, 0.584313725490196} \definecolor{daisy}{rgb}{0.92156862745098, 0.788235294117647, 0.266666666666667} \definecolor{coral}{rgb}{0.949019607843137, 0.427450980392157, 0.129411764705882} \definecolor{kelly}{rgb}{0.509803921568627, 0.576470588235294, 0.337254901960784} \definecolor{jet}{rgb}{0.0745098039215686, 0.0823529411764706, 0.0862745098039216} \definecolor{asher}{rgb}{0.333333333333333, 0.372549019607843, 0.380392156862745} \definecolor{slate}{rgb}{0.192156862745098, 0.309803921568627, 0.309803921568627} \definecolor{cranberry}{rgb}{0.901960784313726, 0.254901960784314, 0.450980392156863} \definecolor{hi}{rgb}{0.984313725490196, 0.12549019607843137, 0.12549019607843137} $$ </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { purple: ["{\\color{purple}{#1}}", 1], navy: ["{\\color{navy}{#1}}", 1], ruby: ["{\\color{ruby}{#1}}", 1], alice: ["{\\color{alice}{#1}}", 1], daisy: ["{\\color{daisy}{#1}}", 1], coral: ["{\\color{coral}{#1}}", 1], kelly: ["{\\color{kelly}{#1}}", 1], jet: ["{\\color{jet}{#1}}", 1], asher: ["{\\color{asher}{#1}}", 1], slate: ["{\\color{slate}{#1}}", 1], cranberry: ["{\\color{cranberry}{#1}}", 1], hi: ["{\\color{hi}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .purple {color: #5601A4;} .navy {color: #0D3D56;} .ruby {color: #9A2515;} .alice {color: #107895;} .daisy {color: #EBC944;} .coral {color: #F26D21;} .kelly {color: #829356;} .jet {color: #131516;} .asher {color: #555F61;} .slate {color: #314F4F;} .cranberry {color: #E64173;} .hi {color: #FB2020;} </style> # Bivariate Descriptive Techniques --- ## Motivation ### The road so far So far, our descriptive measures (e.g., *mean, median, variance, standard deviation*) suit well our purposes when describing a .hi[unique] variable. These measures are also known as .hi-blue[univariate] descriptive techniques. Whenever our goal is to describe a possible .red[*relationship/association*] between two variables, we need to study additional descriptive techniques. These are known as .hi-slate[bivariate] descriptive measures. We will study the three main techniques: - *Covariance*; - *Correlation*; - The *coefficient of determination*. --- ## Bivariate Descriptive Techniques <img src="15-Correlation_files/figure-html/unnamed-chunk-1-1.svg" style="display: block; margin: auto;" /> --- ## Bivariate Descriptive Techniques <img src="15-Correlation_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> --- ## Interpreting a Scatterplot Looking for patterns, and deviations from that pattern - Direction, form, strength of relationship - Any outliers? Describing the association - .hi.kelly[Positive Association]: above-average values of one tend to accompany above-average values of the other, and below-average values also tend to occur together - .hi.ruby[Negative Association]: above-average values of one tend to accompany below-average values of the other, and vice versa In general, if one variable is explanatory (influences change) and one is a response variable (outcome), then the explanatory variable is plotted on the x-axis --- ## Bivariate Descriptive Techniques <img src="15-Correlation_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- ## Covariance Let us start with the .hi[covariance]. The covariance gives two pieces of information about the .red[*association*] between two variables (say, *x* and *y*): the .hi-blue[nature] and the .hi-blue[strength] of this relationship. <br> .pull-left[ - .hi-blue[Population covariance] (σ<sub>xy</sub>): $$ `\begin{aligned} \sigma_{xy} = \dfrac{\displaystyle\sum_{i=1}^{N}(x_{i}-\mu_{x})(y_{i}-\mu_{y})}{N} \end{aligned}` $$ ] .pull-right[ - .hi-blue[Sample covariance] (*s*<sub>xy</sub>): $$ `\begin{aligned} s_{xy} = \dfrac{\displaystyle\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{n-1} \end{aligned}` $$ ] --- ## Covariance <br><br><br> An .hi[alternative] formula for the *sample covariance*: $$ `\begin{aligned} s_{xy} = \dfrac{1}{n-1}\Bigg[\displaystyle\sum_{i=1}^{n}x_{i}y_{i} - \dfrac{\displaystyle\sum_{i=1}^{n}x_{i}\displaystyle\sum_{i=1}^{n}y_{i}}{n} \Bigg] \end{aligned}` $$ --- ## Covariance ```r smoke <- wooldridge::smoke # data from the "wooldridge" package. smoke <- smoke %>% as_tibble() # transforming it into a tibble. smoke_filtered <- smoke %>% filter(cigs > 0) # what is this piece of code doing? smoke_filtered %>% select(cigs, cigpric, educ, age) %>% head() ``` ``` ## # A tibble: 6 × 4 ## cigs cigpric educ age ## <int> <dbl> <dbl> <int> ## 1 3 57.7 12 58 ## 2 10 57.9 13.5 27 ## 3 20 60.3 12 24 ## 4 30 57.9 10 71 ## 5 20 60.1 12 29 ## 6 30 60.7 12 34 ``` --- ## Covariance Data from [`Mullahy (1997)`](https://direct.mit.edu/rest/article-abstract/79/4/586/57029/Instrumental-Variable-Estimation-of-Count-Data): ```r smoke_filtered %>% summarize(covariance_cigpric_cigs = cov(cigpric, cigs)) ``` ``` ## # A tibble: 1 × 1 ## covariance_cigpric_cigs ## <dbl> *## 1 1.75 ``` ```r smoke_filtered %>% summarize(covariance_educ_cigs = cov(educ, cigs)) ``` ``` ## # A tibble: 1 × 1 ## covariance_educ_cigs ## <dbl> *## 1 5.43 ``` --- ## Covariance <img src="15-Correlation_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> --- ## Covariance <img src="15-Correlation_files/figure-html/unnamed-chunk-8-1.svg" style="display: block; margin: auto;" /> --- ## Correlation Now, to the .hi[correlation coefficient]. The coefficient of correlation is .red[*more specific*] than the covariance. The correlation coefficient implies a .hi-slate[linear relationship] between *x* and *y*. Therefore, in case the shape from a *scatter diagram* does not predict a .hi[linear] relationship between the two variables, using the correlation may not be the best measure. <br> .pull-left[ - .hi-blue[Population correlation] (ρ): $$ `\begin{aligned} \rho = \dfrac{\sigma_{xy}}{\sigma_{x}\sigma_{y}} \end{aligned}` $$ ] .pull-right[ - .hi-blue[Sample correlation] (*r*): $$ `\begin{aligned} r = \dfrac{s_{xy}}{s_{x}s_{y}} \end{aligned}` $$ ] .hi[*r*] is also called ..hi-blue[Pearson's correlation coefficient.]. --- ## Correlation The correlation formula relates the covariance between *x* and *y*, divided by the interaction between their respective standard deviations. -- One .hi-blue[advantage] of this coefficient relative to the covariance is that it lies between .b[-1] and .b[+1]. -- <br> - *r = -1* ⇒ *negative*, perfect linear relationship between *x* and *y* - *r = +1* ⇒ *positive*, perfect linear relationship between *x* and *y* - *r = 0* ⇒ *no* linear relationship between *x* and *y* -- <br> - .hi-blue[correlation] **does not imply** .hi-blue[causation] -- .hi[Drawbacks of correlation] - Only measures .hi-blue[linear relationships] (we will see what this means) - Just because correlation is zero doesn't necessarily mean variables are independent - Not resistant to outliers --- ## Correlation Data from [`Mullahy (1997)`](https://direct.mit.edu/rest/article-abstract/79/4/586/57029/Instrumental-Variable-Estimation-of-Count-Data): ```r smoke_filtered %>% summarize(correlation_cigpric_cigs = cor(cigpric, cigs)) ``` ``` ## # A tibble: 1 × 1 ## correlation_cigpric_cigs ## <dbl> *## 1 0.0271 ``` ```r smoke_filtered %>% summarize(correlation_educ_cigs = cor(educ, cigs)) ``` ``` ## # A tibble: 1 × 1 ## correlation_educ_cigs ## <dbl> *## 1 0.156 ``` --- ## Correlations - Visualized <img src="images/lecture13/correlations.jpg" width="75%" style="display: block; margin: auto;" /> --- ## Why Correlation isn't Perfect <img src="images/lecture13/corr.png" width="90%" style="display: block; margin: auto;" /> Bottom row is an example of non-linear relationships --- ## Why Correlation isn't Perfect <img src="images/lecture13/dino.gif" width="90%" style="display: block; margin: auto;" /> --- ## Coefficient of Determination Lastly, the .hi[coefficient of determination]. -- It is more widely known as the *R*<sup>2</sup> *coefficient* or *R-squared*. -- Given the *limitations* of the coefficient of correlation to precisely interpret values other than 0, -1, and +1, the coefficient of determination, *R*<sup>2</sup>, can be .hi[precisely] interpreted. *R-squared* gives you the percentage variation in y explained by x-variables. The range is 0 to 1 (i.e. 0% to 100% of the variation in y can be explained by the x-variables). -- <br> It is obtained by simply .hi-blue[squaring] the correlation coefficient (for either population or sample measures). -- <br> `$$R^2 = r^2 = \left[ \dfrac{s_{xy}}{s_{x}s_{y}} \right]^2$$` --- ## Coefficient of Determination ```r smoke_filtered %>% summarize(R2_cigpric_cigs = cor(cigpric, cigs)^2 * 100) ``` ``` ## # A tibble: 1 × 1 ## R2_cigpric_cigs ## <dbl> *## 1 0.0733 ``` ```r smoke_filtered %>% summarize(R2_educ_cigs = cor(educ, cigs)^2 * 100) ``` ``` ## # A tibble: 1 × 1 ## R2_educ_cigs ## <dbl> *## 1 2.45 ``` --- ## Extra: Statistical Inference for Correlation One may test hypothsis about the population correlation, e.g., $$ H_0: \rho = 0 \qquad \text{vs.} \qquad H_1: \rho \ne 0$$ This can be performed using the `\(t\)`-test. Standard error: If `\(x\)` and `\(y\)` are random variables, a standard error associated to the correlation in the null case is $$ \sigma_{r}={\sqrt {\frac {1-r^{2}}{n-2}}}$$ where `\(r\)` is the correlation and `\(n\)` the sample size. --- ## Testing using Student's t-distribution The sampling distribution of the studentized Pearson's correlation coefficient follows Student's t-distribution with degrees of freedom `\(n − 2\)`. Specifically, if the underlying variables have a bivariate normal distribution, the stattistics `$$t={\frac {r}{\sigma _{r}}}=r{\sqrt {\frac {n-2}{1-r^{2}}}}$$` has a Student's `\(t\)`-distribution in the .hi[null case (zero correlation)]. --- exclude: true