BANL 6100: Business Analytics

.title[
# BANL 6100: Business Analytics
]
.subtitle[
## Confidence Intervals
]
.author[
### Mehmet Balcilar </br> <a href="mailto:mbalcilar@newhaven.edu" class="email">mbalcilar@newhaven.edu</a>
]
.institute[
### Univeristy of New Haven
]
.date[
### 2023-09-28 (updated: 2023-11-03)
]

---

---
class: center, middle, sydney-blue

<div style = "position:fixed; visibility: hidden">
$$
\require{color}
\definecolor{purple}{rgb}{0.337254901960784, 0.00392156862745098, 0.643137254901961}
\definecolor{navy}{rgb}{0.0509803921568627, 0.23921568627451, 0.337254901960784}
\definecolor{ruby}{rgb}{0.603921568627451, 0.145098039215686, 0.0823529411764706}
\definecolor{alice}{rgb}{0.0627450980392157, 0.470588235294118, 0.584313725490196}
\definecolor{daisy}{rgb}{0.92156862745098, 0.788235294117647, 0.266666666666667}
\definecolor{coral}{rgb}{0.949019607843137, 0.427450980392157, 0.129411764705882}
\definecolor{kelly}{rgb}{0.509803921568627, 0.576470588235294, 0.337254901960784}
\definecolor{jet}{rgb}{0.0745098039215686, 0.0823529411764706, 0.0862745098039216}
\definecolor{asher}{rgb}{0.333333333333333, 0.372549019607843, 0.380392156862745}
\definecolor{slate}{rgb}{0.192156862745098, 0.309803921568627, 0.309803921568627}
\definecolor{cranberry}{rgb}{0.901960784313726, 0.254901960784314, 0.450980392156863}
$$
</div>
	
<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
		TeX: {
			Macros: {
				purple: ["{\\color{purple}{#1}}", 1],
				navy: ["{\\color{navy}{#1}}", 1],
				ruby: ["{\\color{ruby}{#1}}", 1],
				alice: ["{\\color{alice}{#1}}", 1],
				daisy: ["{\\color{daisy}{#1}}", 1],
				coral: ["{\\color{coral}{#1}}", 1],
				kelly: ["{\\color{kelly}{#1}}", 1],
				jet: ["{\\color{jet}{#1}}", 1],
				asher: ["{\\color{asher}{#1}}", 1],
				slate: ["{\\color{slate}{#1}}", 1],
				cranberry: ["{\\color{cranberry}{#1}}", 1]
			},
			loader: {load: ['[tex]/color']},
			tex: {packages: {'[+]': ['color']}}
		}
	});
</script>

# Sampling & Inference

---

---
## Statistical Inference: Recap

Recall, we're interested in estimating some unknown .cranberry[population] parameter `$\theta$` using the .kelly[sample] `$x_1, ...,x_n$`

We can use some estimator `$\hat{\theta}$`
      
- We can find its bias and its variance

- Can say what it converges to using Law of Large Numbers and the Central Limit Theorem
      
However, we don't have `$n \to \infty$`. We have a .hi[finite] sample.

- Therefore, we want to construct some belief about how good our estimator is.

- For example, if we have a sample mean with 5 individuals, our sampling distribution has a large variance. We want to report that.

---

## Point and Interval Estimators

A .hi[point] estimator draws inferences about a population by estimating the value of an unknown parameter using a .hi-slate[single] value or point.

In the first part of the course, we spent some time using such estimators.

- E.g., *sample mean, sample variance, sample standard deviation, sample median*....
  
--

These estimators, however, are .hi-green[fragile].

Recall that *P(X = x) = 0* for continuous random variables!

Furthermore, how does .red[*varying the sample size*] reflects how good/bad a point estimator is?

---

## Point and Interval Estimators

.hi[Interval] estimators draw inferences about a population by estimating the value of an unknown parameter using an *interval*, commonly called .hi[confidence interval.]

This way, representing a parameter of interest through an interval is .red[*better suited*] when we don't know the whole population.

And the sample size (*n*) .hi[does matter] here!

---

## Building Confidence Intervals

As we've already studied, the .hi[fundamental] parameters of the Normal distribution are the population .hi-blue[mean] and .hi-blue[standard deviation].

$$
`\begin{aligned}
X \sim \mathcal{N}(\mu, \sigma^2)
\end{aligned}`
$$

<br>

---

## Building Confidence Intervals

Given that the Normal distribution is .hi-slate[symmetric] about its .hi[mean], one very useful transformation we can apply to a normally distributed random variable is .b[standardization].

This simply implies transforming a variable such that it follows a .hi[Standard Normal] distribution.

- The Standard Normal distribution implies a mean of 0 and a standard deviation (variance) of 1.
  
  - `$X \sim \mathcal{N}(0, 1)$`
  
--

<br>

To .hi-blue[standardize] a random variable, we apply the following formula:

$$
`\begin{aligned}
z = \dfrac{x - \mu}{\sigma}
\end{aligned}`
$$
---

## Building Confidence Intervals

A quick example:

<br>

Suppose the daily demand for gasoline at a station is a normally distributed random variable with a mean of 1,000 and a standard deviation of 100 gallons.

The owner opened this station today and noted that there is exactly 1,100 gallons in storage. The next delivery will only happen tomorrow. She would like to know the probability that she will have enough regular gasoline to satisfy today’s demand.

---

## Building Confidence Intervals

Visually:

---

## Building Confidence Intervals

And the area we are interested in is:

---

## Building Confidence Intervals

Let us standardize our random variable *X*:

$$
`\begin{aligned}
z = \dfrac{x - \mu}{\sigma} = \dfrac{1,100 - 1000}{100} = 1.00
\end{aligned}`
$$

`$z$` is called the .hi[z-score].

<br>

Now, the random variable *z* follows a Standard Normal distribution!

And we are able to ask *P(z < 1.00)*, instead of *P(X < 1,100)*.

---

## Building Confidence Intervals

Visually:

---

## Building Confidence Intervals

The values of *z* specify the .hi[location] of the corresponding value of *X*, our random variable.

A value of *z = 1.00* corresponds to a value of *X* that is .hi-blue[1.00 standard deviation above the mean].

Another *advantage* of standardizing a random variable to a *z* value is that it automatically centers the population mean (*&mu;*) to .hi-blue[zero].

So what is this probability?

```r
pnorm(q = 1, mean = 0, sd = 1)
```

```
## [1] 0.8413447
```

Which is the same as

```r
pnorm(q = 1100, mean = 1000, sd = 100)
```

```
## [1] 0.8413447
```

---

## Building Confidence Intervals

Now, we are ready to build a .hi[confidence interval] for a .hi-blue[population mean] of interest.

Applying the standardization process to a sample mean `$\bar{x}$`, we have

$$
`\begin{aligned}
z = \dfrac{\bar{x} - \mu}{\sigma / \sqrt{n}}
\end{aligned}`
$$

<br>

And we want to get here:

$$
`\begin{aligned}
P(\bar{x} - z_{\dfrac{\alpha}{2}}\sigma/\sqrt{n} < \mu < \bar{x} + z_{\dfrac{\alpha}{2}}\sigma/\sqrt{n}) = 1 - \alpha
\end{aligned}`
$$
---

## Building Confidence Intervals

Again:

$$
`\begin{aligned}
P(\bar{x} - z_{\dfrac{\alpha}{2}}\sigma/\sqrt{n} < \mu < \bar{x} + z_{\dfrac{\alpha}{2}}\sigma/\sqrt{n}) = 1 - \alpha
\end{aligned}`
$$

The **left-hand side** is a probability statement, which considers the probability with which the **population mean** `$(\mu)$` lies between two values of the **sample mean**: one lower value, **corrected** downwards by  a **standard error** `$(z_{\alpha/2})$`, multiplied by the **sampling distribution**'s standard deviation `$(\sigma/\sqrt{n})$`; and one upper value, **corrected** upwards by the same factor.

In other words, the above says that, with **repeated sampling** from this population, the proportion of values of `$\bar{X}$` for which the interval `$[\bar{x} - z_{\alpha/2}\sigma/\sqrt{n}; \bar{x} + z_{\alpha/2}\sigma/\sqrt{n}]$` **includes** the population mean `$\mu$` is **equal to** `$1-\alpha$`.

---

## Building Confidence Intervals

This form of probability statement is called the confidence interval estimator of `$\mu$`.

The left part of the inequality on the left-hand side is known as the .hi[lower confidence limit] (LCL); while the right portion of the inequality is the .hi[upper confidence limit] (UCL).

<br>

The right-hand side, `$1-\alpha$`, is the .hi-blue[confidence level] assumed for the confidence interval.

- The latter is usually pre-specified, and represents the probability that the interval includes the actual value of `$\mu$`.
  
---

## Building Confidence Intervals

An example:

A computer store manager would like to estimate, with 95% of confidence, the optimum average inventory level. She also knows that the overall standard deviation is 75 computers.

She has used a random sample of 25 periods, calculating a sample mean of 370.16 computers.

What is the .hi[area] we are interested in?

<img src="10-Confidence-Intervals_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" />
---
To visualize with `$\bar{x}=370.16$` and `$\sigma=75$`:

---

## Building Confidence Intervals

An example:

Note: `$1-\alpha=0.95$` implies `$\alpha=0.05$`

Thus, `$\dfrac{\alpha}{2}=0.025$`.

So,
`$z_{\alpha/2} \approx 2$` (Indeed: `$z_{\alpha/2} = 1.96$`)

Confidence interval:

$$
[370.16 - 2\times75/\sqrt{25};\ \ 370.16 + 2\times75/\sqrt{25}] \implies [370.16 - 30;\ \ 370.16 + 30] \implies [340.16;\ \ 400.16]
$$
---
## Margins of Error

In that example, the confidence interval was `$370.16 \pm 30$`. In general, a confidence interval takes the form

`$$\kelly{\text{estimate}} \pm \daisy{\text{margin of error}}$$`

where the .hi.daisy[margin of error] shows how much variability there is in our estimate

---
## Margins of Error

For a given level of confidence, `$c$` (say 95%), the .daisy[margin of error] for our sample mean as:

`$$\daisy{k} = \kelly{z_{\frac{1-c}{2}}}\frac{\sigma}{\sqrt{n}}$$`

<br>

- Let's say `$c = 95\%$`. We want to capture the middle 95%, so `$\frac{1-.95}{2} = 2.5\%$` in each tail.

- `$c_{0.025} = 1.96$` which is where the `$\sim 2$` standard deviations comes from.

- 90% Confidence Interval:  `$\implies z_{\frac{1-c}{2}} = z_{.05} = 1.645$` standard deviations.

---
## Example

Lets determine the .it[exact] margin of error for previous example

`$$k=z_{\frac{1-c}{2}} \times \frac{\sigma}{\sqrt{n}}$$`

If we are calculating 95% confidence interval, where `$\bar{x}=370.16$` `$\sigma = 75$`,  then

`$$k=z_{0.025} \times \frac{75}{\sqrt{25}}$$`

We find `$z_{0.025}$` using the table. `$z_{0.025}$` is the z-score such that `$P(Z>z_{0.025})=0.025$`

- Look up 0.025 (or 0.975!) and find the corresponding z-score

---
## Example

Using the z-table, we find that `$z_{0.025} = 1.96$`. This means:

`$$k = 1.96\times \frac{75}{\sqrt{25}} = 29.4$$`

This means our .it[exact] 95% confidence interval is:

`$$[340.76;\ \ 399.56]$$`

---