BANL 6100: Business Analytics

.title[
# BANL 6100: Business Analytics
]
.subtitle[
## Bivariate Descriptive Techniques
]
.author[
### Mehmet Balcilar </br> <a href="mailto:mbalcilar@newhaven.edu" class="email">mbalcilar@newhaven.edu</a>
]
.institute[
### Univeristy of New Haven
]
.date[
### 2023-09-28 (updated: 2024-01-28)
]

---

---
class: center, middle, sydney-blue

<div style = "position:fixed; visibility: hidden">
$$
\require{color}
\definecolor{purple}{rgb}{0.337254901960784, 0.00392156862745098, 0.643137254901961}
\definecolor{navy}{rgb}{0.0509803921568627, 0.23921568627451, 0.337254901960784}
\definecolor{ruby}{rgb}{0.603921568627451, 0.145098039215686, 0.0823529411764706}
\definecolor{alice}{rgb}{0.0627450980392157, 0.470588235294118, 0.584313725490196}
\definecolor{daisy}{rgb}{0.92156862745098, 0.788235294117647, 0.266666666666667}
\definecolor{coral}{rgb}{0.949019607843137, 0.427450980392157, 0.129411764705882}
\definecolor{kelly}{rgb}{0.509803921568627, 0.576470588235294, 0.337254901960784}
\definecolor{jet}{rgb}{0.0745098039215686, 0.0823529411764706, 0.0862745098039216}
\definecolor{asher}{rgb}{0.333333333333333, 0.372549019607843, 0.380392156862745}
\definecolor{slate}{rgb}{0.192156862745098, 0.309803921568627, 0.309803921568627}
\definecolor{cranberry}{rgb}{0.901960784313726, 0.254901960784314, 0.450980392156863}
\definecolor{hi}{rgb}{0.984313725490196, 0.12549019607843137, 0.12549019607843137}
$$
</div>
	
<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
		TeX: {
			Macros: {
				purple: ["{\\color{purple}{#1}}", 1],
				navy: ["{\\color{navy}{#1}}", 1],
				ruby: ["{\\color{ruby}{#1}}", 1],
				alice: ["{\\color{alice}{#1}}", 1],
				daisy: ["{\\color{daisy}{#1}}", 1],
				coral: ["{\\color{coral}{#1}}", 1],
				kelly: ["{\\color{kelly}{#1}}", 1],
				jet: ["{\\color{jet}{#1}}", 1],
				asher: ["{\\color{asher}{#1}}", 1],
				slate: ["{\\color{slate}{#1}}", 1],
				cranberry: ["{\\color{cranberry}{#1}}", 1],
				hi: ["{\\color{hi}{#1}}", 1]
			},
			loader: {load: ['[tex]/color']},
			tex: {packages: {'[+]': ['color']}}
		}
	});
</script>

# Bivariate Descriptive Techniques

---

## Motivation

### The road so far

So far, our descriptive measures (e.g., *mean, median, variance, standard deviation*) suit well our purposes when describing a .hi[unique] variable.

These measures are also known as .hi-blue[univariate] descriptive techniques.

Whenever our goal is to describe a possible .red[*relationship/association*] between two variables, we need to study additional descriptive techniques.

These are known as .hi-slate[bivariate] descriptive measures.

We will study the three main techniques:

- *Covariance*;
  - *Correlation*;
  - The *coefficient of determination*.
  
---

## Bivariate Descriptive Techniques

---

## Bivariate Descriptive Techniques

---

## Interpreting a Scatterplot

Looking for patterns, and deviations from that pattern
      
- Direction, form, strength of relationship
- Any outliers?
      
Describing the association
      
- .hi.kelly[Positive Association]: above-average values of one tend to accompany above-average values of the other, and below-average values also tend to occur together
- .hi.ruby[Negative Association]: above-average values of one tend to accompany below-average values of the other, and vice versa
      
In general, if one variable is explanatory (influences change) and one is a response variable (outcome), then the explanatory variable is plotted on the x-axis

---

## Bivariate Descriptive Techniques

---

## Covariance

Let us start with the .hi[covariance].

The covariance gives two pieces of information about the .red[*association*] between two variables (say, *x* and *y*): the .hi-blue[nature] and the .hi-blue[strength] of this relationship.

<br>

- .hi-blue[Population covariance] (&sigma;<sub>xy</sub>):

$$
`\begin{aligned}
\sigma_{xy} = \dfrac{\displaystyle\sum_{i=1}^{N}(x_{i}-\mu_{x})(y_{i}-\mu_{y})}{N}
\end{aligned}`
$$

]

- .hi-blue[Sample covariance] (*s*<sub>xy</sub>):

$$
`\begin{aligned}
s_{xy} = \dfrac{\displaystyle\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{n-1}
\end{aligned}`
$$

]

---

## Covariance

An .hi[alternative] formula for the *sample covariance*:

$$
`\begin{aligned}
s_{xy} =  \dfrac{1}{n-1}\Bigg[\displaystyle\sum_{i=1}^{n}x_{i}y_{i} - \dfrac{\displaystyle\sum_{i=1}^{n}x_{i}\displaystyle\sum_{i=1}^{n}y_{i}}{n} \Bigg]
\end{aligned}`
$$
---

## Covariance

```r
smoke <- wooldridge::smoke                   # data from the "wooldridge" package.

smoke <- smoke %>% as_tibble()  # transforming it into a tibble.

smoke_filtered <- smoke %>% 
  filter(cigs > 0)              # what is this piece of code doing?

smoke_filtered %>% 
  select(cigs, cigpric, educ, age) %>% 
  head()
```

```
## # A tibble: 6 × 4
##    cigs cigpric  educ   age
##   <int>   <dbl> <dbl> <int>
## 1     3    57.7  12      58
## 2    10    57.9  13.5    27
## 3    20    60.3  12      24
## 4    30    57.9  10      71
## 5    20    60.1  12      29
## 6    30    60.7  12      34
```

---

## Covariance

Data from [`Mullahy (1997)`](https://direct.mit.edu/rest/article-abstract/79/4/586/57029/Instrumental-Variable-Estimation-of-Count-Data):

```r
smoke_filtered %>% 
  summarize(covariance_cigpric_cigs = cov(cigpric, cigs))
```

```
## # A tibble: 1 × 1
##   covariance_cigpric_cigs
##                     <dbl>
*## 1                    1.75
```

```r
smoke_filtered %>% 
  summarize(covariance_educ_cigs = cov(educ, cigs))
```

```
## # A tibble: 1 × 1
##   covariance_educ_cigs
##                  <dbl>
*## 1                 5.43
```

---

## Covariance

---

## Covariance

---

## Correlation

Now, to the .hi[correlation coefficient].

The coefficient of correlation is .red[*more specific*] than the covariance.

The correlation coefficient implies a .hi-slate[linear relationship] between *x* and *y*.

Therefore, in case the shape from a *scatter diagram* does not predict a .hi[linear] relationship between the two variables, using the correlation may not be the best measure.

<br>

- .hi-blue[Population correlation] (&rho;):

$$
`\begin{aligned}
\rho = \dfrac{\sigma_{xy}}{\sigma_{x}\sigma_{y}}
\end{aligned}`
$$

]

- .hi-blue[Sample correlation] (*r*):
$$
`\begin{aligned}
r = \dfrac{s_{xy}}{s_{x}s_{y}}
\end{aligned}`
$$

]

---

## Correlation

The correlation formula relates the covariance between *x* and *y*, divided by the interaction between their respective standard deviations.

One .hi-blue[advantage] of this coefficient relative to the covariance is that it lies between .b[-1] and .b[+1].

<br>

- *r = -1* &#8658; *negative*, perfect linear relationship between *x* and *y*
  
  - *r = +1* &#8658; *positive*, perfect linear relationship between *x* and *y*
  
  - *r = 0* &#8658; *no* linear relationship between *x* and *y*

--
<br>

- .hi-blue[correlation] **does not imply** .hi-blue[causation]

- Only measures .hi-blue[linear relationships] (we will see what this means)
       
- Just because correlation is zero doesn't necessarily mean variables are independent
      
- Not resistant to outliers

---

## Correlation

Data from [`Mullahy (1997)`](https://direct.mit.edu/rest/article-abstract/79/4/586/57029/Instrumental-Variable-Estimation-of-Count-Data):

```r
smoke_filtered %>% 
  summarize(correlation_cigpric_cigs = cor(cigpric, cigs))
```

```
## # A tibble: 1 × 1
##   correlation_cigpric_cigs
##                      <dbl>
*## 1                   0.0271
```

```r
smoke_filtered %>% 
  summarize(correlation_educ_cigs = cor(educ, cigs))
```

```
## # A tibble: 1 × 1
##   correlation_educ_cigs
##                   <dbl>
*## 1                 0.156
```

---
## Correlations - Visualized

---
## Why Correlation isn't Perfect

Bottom row is an example of non-linear relationships

---
## Why Correlation isn't Perfect

---

## Coefficient of Determination

Lastly, the .hi[coefficient of determination].

It is more widely known as the *R*<sup>2</sup> *coefficient* or *R-squared*.

Given the *limitations* of the coefficient of correlation to precisely interpret values other
than 0, -1, and +1, the coefficient of determination, *R*<sup>2</sup>, can be .hi[precisely] interpreted.

*R-squared* gives you the percentage variation in y explained by x-variables. The range is 0 to 1 (i.e. 0% to 100% of the variation in y can be explained by the x-variables).
--

<br>

It is obtained by simply .hi-blue[squaring] the correlation coefficient (for either population or sample measures).

--
<br>

`$$R^2 = r^2 = \left[ \dfrac{s_{xy}}{s_{x}s_{y}} \right]^2$$`

---

## Coefficient of Determination

```r
smoke_filtered %>% 
  summarize(R2_cigpric_cigs = cor(cigpric, cigs)^2 * 100)
```

```
## # A tibble: 1 × 1
##   R2_cigpric_cigs
##             <dbl>
*## 1          0.0733
```

```r
smoke_filtered %>% 
  summarize(R2_educ_cigs = cor(educ, cigs)^2 * 100)
```

```
## # A tibble: 1 × 1
##   R2_educ_cigs
##          <dbl>
*## 1         2.45
```

---
## Extra: Statistical Inference for Correlation

One may test hypothsis about the population correlation, e.g.,

$$ H_0: \rho = 0 \qquad \text{vs.} \qquad H_1: \rho \ne 0$$

This can be performed using the `$t$`-test.

Standard error:  If `$x$` and `$y$` are random variables, a standard error associated to the correlation in the null case is

$$ \sigma_{r}={\sqrt {\frac {1-r^{2}}{n-2}}}$$

where `$r$` is the correlation and `$n$` the sample size.

---

## Testing using Student's t-distribution

The sampling distribution of the studentized Pearson's correlation coefficient follows Student's t-distribution with degrees of freedom `$n − 2$`.

Specifically, if the underlying variables have a bivariate normal distribution, the stattistics

`$$t={\frac {r}{\sigma _{r}}}=r{\sqrt {\frac {n-2}{1-r^{2}}}}$$`

has a Student's `$t$`-distribution in the .hi[null case (zero correlation)].

---