BANL 6100: Business Analytics

.title[
# BANL 6100: Business Analytics
]
.subtitle[
## Statistical Inference
]
.author[
### Mehmet Balcilar </br> <a href="mailto:mbalcilar@newhaven.edu" class="email">mbalcilar@newhaven.edu</a>
]
.institute[
### Univeristy of New Haven
]
.date[
### 2023-09-28 (updated: 2023-10-31)
]

---

---
class: center, middle, sydney-blue

<div style = "position:fixed; visibility: hidden">
$$
\require{color}
\definecolor{purple}{rgb}{0.337254901960784, 0.00392156862745098, 0.643137254901961}
\definecolor{navy}{rgb}{0.0509803921568627, 0.23921568627451, 0.337254901960784}
\definecolor{ruby}{rgb}{0.603921568627451, 0.145098039215686, 0.0823529411764706}
\definecolor{alice}{rgb}{0.0627450980392157, 0.470588235294118, 0.584313725490196}
\definecolor{daisy}{rgb}{0.92156862745098, 0.788235294117647, 0.266666666666667}
\definecolor{coral}{rgb}{0.949019607843137, 0.427450980392157, 0.129411764705882}
\definecolor{kelly}{rgb}{0.509803921568627, 0.576470588235294, 0.337254901960784}
\definecolor{jet}{rgb}{0.0745098039215686, 0.0823529411764706, 0.0862745098039216}
\definecolor{asher}{rgb}{0.333333333333333, 0.372549019607843, 0.380392156862745}
\definecolor{slate}{rgb}{0.192156862745098, 0.309803921568627, 0.309803921568627}
\definecolor{cranberry}{rgb}{0.901960784313726, 0.254901960784314, 0.450980392156863}
$$
</div>
	
<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
		TeX: {
			Macros: {
				purple: ["{\\color{purple}{#1}}", 1],
				navy: ["{\\color{navy}{#1}}", 1],
				ruby: ["{\\color{ruby}{#1}}", 1],
				alice: ["{\\color{alice}{#1}}", 1],
				daisy: ["{\\color{daisy}{#1}}", 1],
				coral: ["{\\color{coral}{#1}}", 1],
				kelly: ["{\\color{kelly}{#1}}", 1],
				jet: ["{\\color{jet}{#1}}", 1],
				asher: ["{\\color{asher}{#1}}", 1],
				slate: ["{\\color{slate}{#1}}", 1],
				cranberry: ["{\\color{cranberry}{#1}}", 1]
			},
			loader: {load: ['[tex]/color']},
			tex: {packages: {'[+]': ['color']}}
		}
	});
</script>

# Sampling & Inference

---

## Motivation &mdash; I

### Research Question

- Suppose you are moving to Ames, Iowa and are considering buying a home.

- How would you know whether or not the house you are considering is averagely priced, significantly more expensive, or significantly less expensive?

### Investigation

- It is possible to gather data selling prices of similar homes in the area.

- From this data, which we call a **reference distribution**, it can be determined whether or not the price of a particular home is on par with similar homes in the neighborhood.

- Say, for example, the price of a new home for sale in Ames, Iowa is $610,000.

---

## Motivation &mdash; II

### Investigation

- Using historical data, we can compare this price against a reference distribution. This is illustrated in the code chunk below using the `ames` data frame.

```r
# install.packages(AmesHousing) # uncomment if AmesHousing is not installed 
library(AmesHousing)
Sale_Price <- AmesHousing::make_ames()$Sale_Price
(610000 - mean(Sale_Price)) / sd(Sale_Price)  # compute a z-score
```

```
## [1] 5.372659
```

- So a house costing $610,000 is more than five standard deviations beyond the mean of all the houses sold between the years 2006 and 2010

- Of course, a more fair comparison would only involve houses with similar features (e.g., a fireplace, finished basement, same neighborhood, etc.).

---

## The Frequentist Approach &mdash; I

- The most common methods in statistical inference are based on the frequentist approach to probability.

- Many of the common statistical tests, like the one-sample `$t$`-test, follow the same paradigm: 
  -- compute a test statistic associated with the population attribute of interest (e.g., the mean)
  -- determine it's **sampling distribution**, and 
  -- use the sampling distribution to compute a `$p$`-value, construct a **confidence interval**, etc.

---

## The Frequentist Approach &mdash; II

>The sampling distribution of a statistic (e.g., a test statistic), based on a sample of size `$n$`, is the distribution obtained after taking every possible sample of size `$n$` from the population of interest and computing the sample statistic for each.

<div class="figure" style="text-align: center">
<img src="images/lecture9/sampling-distribution.png" alt="Frequentist approach to sampling and sampling distributions." width="500px" />
<p class="caption">Frequentist approach to sampling and sampling distributions.</p>
</div>

---

## Sampling Distributions

<br>

There are .hi[2] ways to approach sampling distributions.

The .b[first] is to repeatedly draw .hi[samples of the same size] (*n*) from a .hi-blue[population] of interest (*N*), and calculate the statistic of interest.

- However, it is almost impossible to access data for an entire population.
  
--

The .b[second] is to use the laws of .hi[Expected Value and Variance].

- More feasible!

---
# Greek Letters and Statistics

- Latin letters like `$\bar{x}$` and `$s^2$` are calculations that represent guesses (estimates) at the population values.
]
.pull-right[
.hi.purple[Greek Letters]

- Greek letters like `$\mu$` and `$\sigma^2$` represent the truth about the population.
]

The goal of the researcher is for the latin letters to be good guesses for the greek letters:

$$
	\kelly{\text{Data}} \longrightarrow \kelly{\text{Calculation}} \longrightarrow \kelly{\text{Estimates}} \longrightarrow^{hopefully!} \purple{\text{Truth}}
$$

For example, 
$$
	\kelly{X} \longrightarrow \kelly{\frac{1}{n} \sum_{i=1}^n x_i} \longrightarrow \kelly{\bar{x}} \longrightarrow^{hopefullly!} \purple{\mu}
$$

---

## Sampling Distributions

Let us demonstrate the first approach, using the [`AmesHousing`](http://jse.amstat.org/v19n3/decock.pdf) data set.

- It includes data on .hi[all] residential home sales in Ames, Iowa, between 2006 and 2010.
  
  - Thus, these data may serve as a .hi-slate[populational] reference.
  
--

<br>

```r
library(AmesHousing)   ## where the data come from
library(janitor)       ## package for data cleaning.

ames <- ames_raw       ## picking one of the package's data sets.

ames <- ames %>% 
  clean_names()        ## using 'janitor' to clean the column names.
```

---

## Sampling Distributions

```r
ames %>% 
  select(gr_liv_area) %>% 
  head(6)      ## above ground living area (in square feet).
```

```
## # A tibble: 6 × 1
##   gr_liv_area
##         <int>
## 1        1656
## 2         896
## 3        1329
## 4        2110
## 5        1629
## 6        1604
```

---

## Sampling Distributions

---

## Sampling Distributions

Since we have the whole .hi-slate[population] data, we can compute population parameters, such as *&mu;*, *&sigma;<sup>2</sup>*, and *&sigma;*:

```r
ames %>% 
  summarize(pop_mean = mean(gr_liv_area),
            pop_variance = var(gr_liv_area),
            pop_sd = sd(gr_liv_area))
```

```
## # A tibble: 1 × 3
##   pop_mean pop_variance pop_sd
##      <dbl>        <dbl>  <dbl>
## 1    1500.      255539.   506.
```

<br>

Now, let us repeatedly draw .hi[samples of the same size] from this population, and see how the value of *&mu;* and *&sigma;<sup>2</sup>* behave.

---

## Sampling Distributions

```r
area <- ames %>% 
  pull(gr_liv_area)  ## pulling the values for the variable of interest.
```

<br>

```r
# A "for" loop:

sample_means50 <- rep(NA, 5000)  ## creating an empty vector of 5000 values.

for(i in 1:5000){                ## starting the loop (5,000 iterations).
  
  s50 <- sample(area, 50)        ## drawing samples of size n = 50
  
  sample_means50[i] <- mean(s50) ## filling the empty values with the sample means.
  
}
```

---

## Sampling Distributions

---

## Sampling Distributions

Now, instead of samples of size *n = 50*, what about *n = 500*?

<br>

```r
sample_means500 <- rep(NA, 5000)

for(i in 1:5000){   
  
  s500 <- sample(area, 500)
  
  sample_means500[i] <- mean(s500)
  
}
```

---

## Sampling Distributions

---

## Sampling Distributions

Now, the two together...

---

## Sampling Distributions

<br>

Having access to the whole population, we may draw samples of the same size and .hi[repeatedly] compute .red[*sample statistics*] from these samples.

And as the sample size .hi-blue[increases], the .red[*variance*] (and standard deviation) is reduced.

- More precision!
  
--

<br>

But when we do not have the luxury of accessing the whole population, we may appeal to the laws of .red[*Expected Value and Variance*].

---

## Sampling Distributions

The .hi[variance] of the sampling distribution of the sample mean will be the variance of *X*, divided by the sample size, *n*.

$$
`\begin{aligned}
\sigma^2_{\bar{x}} = \dfrac{\sigma^2}{n}
\end{aligned}`
$$

Not surprisingly, the .hi-slate[standard deviation] will be

$$
`\begin{aligned}
\sigma_{\bar{x}} = \dfrac{\sigma}{\sqrt{n}}
\end{aligned}`
$$

Moreover, as the .hi-blue[sample size increases], that is, the sampling distribution of `$\bar{x}$` becomes .red[*increasingly bell-shaped*].

In other words, its bell curve becomes .hi-green[narrower] as the sample size is increased.

---

## Sampling Distributions & Sample Size

The variance of the .it.coral[sampling distribution] depends on the sample size. As $n$ gets larger, each individual .hi[trial] gives a better guess at the mean. Hence, the .coral[sampling distribution] gets more narrow

---
## Sampling Distributions & Sample Size

---
## Sampling Distributions

We will only observe 1 sample in the world though.

How does the concept of .coral[sampling distribution] help us?

- Since we don't know the true population parameter, Our .ruby[sample statistic] will be our best guess at the possible true value.

- If we know the .coral[sampling distribution], then we can consider uncertainty about our .ruby[sample statistic].

---
## Law of Large Numbers

Is `$\bar{x}$` actually a good guess for `$\mu$`? Under certain conditions, we can use the .hi.purple[Law of Large Numbers (LLN)] to guarantee that `$\bar{x}$` approaches `$\mu$` as the sample size grows large.

Let `$x_1,x_2,...,x_n$` be an i.i.d. set of observations with `$E(x_i) = \mu$`.

Define the sample mean of size `$n$` as `$\bar{x}_n = \frac{1}{n}\sum_{i = 1}^{n}x_i$`. Then

$$ \bar{x}_n \to \mu \quad \text{as} \quad n \to \infty. $$

Intuitively, as we observe a larger and larger sample, we average over randomness and our sample mean approaches the true population mean.

if the same experiment or study is repeated independently a large number of times, the average of the results of the trials must be close to the expected value. The result becomes closer to the expected value as the number of trials is increased.

---
## Law of Large Numbers

---
## Law of Large Numbers

---
## Central Limit Theorem

If the number of observation, `$n$`, per sample is large (we will discuss this more later), then the distribution of `$X_i$` doesn't matter. We will always have

$$
	\bar{x}_n \sim N(\mu, \frac{\sigma^2}{n}).
$$

> The sampling distribution of the mean of a random sample drawn from any population is .hi-slate[approximately Normal] for a sufficiently large sample size. The larger the sample size, the more closely the sampling distribution of `$\bar{X}$` will resemble a Normal distribution.

<br>

In many practical situations, a sample size of .hi[30] may be sufficiently large to allow us to use the Normal distribution as an approximation for the sampling distribution of `$\bar{X}$`.

---