class: center, middle, inverse, title-slide .title[ # BANL 6100: Business Analytics ] .subtitle[ ## Inference about a Population Proportion ] .author[ ### Mehmet Balcilar
mbalcilar@newhaven.edu
] .institute[ ### Univeristy of New Haven ] .date[ ### 2023-09-28 (updated: 2024-01-28) ] --- exclude: true --- class: center, middle, sydney-blue <!-- Custom css --> <!-- From xaringancolor --> <div style = "position:fixed; visibility: hidden"> $$ \require{color} \definecolor{purple}{rgb}{0.337254901960784, 0.00392156862745098, 0.643137254901961} \definecolor{navy}{rgb}{0.0509803921568627, 0.23921568627451, 0.337254901960784} \definecolor{ruby}{rgb}{0.603921568627451, 0.145098039215686, 0.0823529411764706} \definecolor{alice}{rgb}{0.0627450980392157, 0.470588235294118, 0.584313725490196} \definecolor{daisy}{rgb}{0.92156862745098, 0.788235294117647, 0.266666666666667} \definecolor{coral}{rgb}{0.949019607843137, 0.427450980392157, 0.129411764705882} \definecolor{kelly}{rgb}{0.509803921568627, 0.576470588235294, 0.337254901960784} \definecolor{jet}{rgb}{0.0745098039215686, 0.0823529411764706, 0.0862745098039216} \definecolor{asher}{rgb}{0.333333333333333, 0.372549019607843, 0.380392156862745} \definecolor{slate}{rgb}{0.192156862745098, 0.309803921568627, 0.309803921568627} \definecolor{cranberry}{rgb}{0.901960784313726, 0.254901960784314, 0.450980392156863} \definecolor{hi}{rgb}{0.984313725490196, 0.12549019607843137, 0.12549019607843137} $$ </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { purple: ["{\\color{purple}{#1}}", 1], navy: ["{\\color{navy}{#1}}", 1], ruby: ["{\\color{ruby}{#1}}", 1], alice: ["{\\color{alice}{#1}}", 1], daisy: ["{\\color{daisy}{#1}}", 1], coral: ["{\\color{coral}{#1}}", 1], kelly: ["{\\color{kelly}{#1}}", 1], jet: ["{\\color{jet}{#1}}", 1], asher: ["{\\color{asher}{#1}}", 1], slate: ["{\\color{slate}{#1}}", 1], cranberry: ["{\\color{cranberry}{#1}}", 1], hi: ["{\\color{hi}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .purple {color: #5601A4;} .navy {color: #0D3D56;} .ruby {color: #9A2515;} .alice {color: #107895;} .daisy {color: #EBC944;} .coral {color: #F26D21;} .kelly {color: #829356;} .jet {color: #131516;} .asher {color: #555F61;} .slate {color: #314F4F;} .cranberry {color: #E64173;} .hi {color: #FB2020;} </style> # Inference about a Population Proportion --- ## Inference about a Population Proportion Previously we have discussed making inference in population *means* This chapter talks about questions where we're interested in the proportion of an outcome .bf.alice[Single population proportion] - .ex[Examples:] Proportion of people voting for a candidate; percent of people who are vaccinated; percent of people who support an issue; etc. .bf.alice[Comparing two population proportions] - .ex[Examples:] Is there a difference between the proportion of male students and proportion of female students that smoke cigarettes; Do Republicans and Democrats differ in their support for policy X; etc. --- ## The Sample Proportion, `\(\hat{p}\)` The statistic that estimates the population proportion, `\(p\)`, is the .hi.coral[sample proportion]: $$ \coral{\hat{p}} = \frac{\text{number of “successes" in the sample}}{n} $$ For example: Say we want to estimate the proportion of adults who smokes. To estimate this proportion, a researcher collected survey data and contacted 2673 people, and 170 said they smoke: $$ \coral{\hat{p}} = \frac{170}{2673} = 0.0636 $$ --- ## Sampling Distribution of a Sample Proportion .subheader.alice[Binomial Distribution Review] We can think of binary random variables (take only two values) as a Bernoulli distribution: - Assign one outcome 0 and the other outcome 1 - `\(X \sim B(1,p)\)` - This means `\(p\)` is the **unobserved** probability of outcome 1 occuring We use the sample statistic: `$$\coral{\hat{p}} = \frac{\text{number of successes}}{\text{total observations}}$$` - The .hi.kelly[mean] of the sampling distribution is `\(p\)` - The .hi.kelly[standard deviation] of the sampling distribution is `$$\mathit{SE}_p=\sqrt{\frac{p(1-p)}{n}}$$` --- ## Sampling Distribution of a Sample Proportion .subheader.alice[Binomial Distribution Review] Say we draw a simple random sample of size `\(n\)` from a large population that contains `\(p\)` proportion of successes. Let `\(\coral{\hat{p}}\)` be the .hi.purple[sample proportion] of successes, $$ \coral{\hat{p}} = \frac{\text{number of successes in the sample}}{n} $$ The .hi.purple[Central Limit Theorem] tells us that with a large enough sample size, the standardized value of `\(\coral{\hat{p}}\)` will be approximately normal: $$ \frac{\coral{\hat{p}} - p}{\sqrt{p(1-p)/n}} \sim N(0, 1) $$ - `\(p\)` is the true population proportion - `\(\sqrt{p(1-p)/n}\)` is the true population standard deviation --- ## Example: Election A poll by YouGov asked `\(1360\)` voters in Pennsylvania if they were going to vote for Biden or Trump the day before the election. We will code a vote for Biden `\(=1\)`, so the proportion `\(\coral{\hat{p}}\)` is the proportion of people who will vote for Biden. Biden will win Pennsylvania if the population portion is `\(p > .5\)`. They find that `\(\coral{\hat{p}} = .53\)`. **What is the sampling distrubtion of** `\(\coral{\mathbf{\hat{p}}}\)`? ### Answer `$$\coral{\hat{p}} \sim N\left(p_0, \frac{p_0(1-p_0)}{n}\right)$$` **What's the probability Biden Wins PA?** Using the sampling distribution, what's the probability that `\(p > .50\)`? ```r 1-pnorm(q = .50, mean = 0.53, sd = sqrt(0.50*(1-0.50)/1360)) ``` ``` ## [1] 0.9865405 ``` .footnote.small[Source: https://projects.fivethirtyeight.com/polls/] --- ## Confidence Intervals for a Population Proportion We follow the same path from sampling distribution to confidence interval as we did for `\(\bar{X}\)` Note, the standard deviation of `\(\coral{\hat{p}}\)` depends on the parameter `\(p\)` -- a value that we don't know. We therefore estimate the standard deviation with the standard error of `\(\coral{\hat{p}}\)`: $$ SE_{\coral{\hat{p}}}=\sqrt{\frac{\coral{\hat{p}}(1-\coral{\hat{p}})}{n}} $$ - Remember! Estimating `\(SE\)` means `\(t\)`-dist if sample is small!!! --- ## Confidence Intervals for a Population Proportion Say we draw a simple random sample of size `\(n\)` from a large population that contains an unknown proportion `\(p\)` of successes. An approximate C% .hi.purple[confidence interval] for p is: $$ \coral{\hat{p}} \pm z_{\frac{1-c}{2}} \sqrt{\frac{\coral{\hat{p}}(1-\coral{\hat{p}})}{n}} $$ **What do we mean by large?** Can only use this confidence interval when number of successes and failures in the sample are both at least 15 (to remember, half of 30 each). --- ## Example A poll by YouGov asked 1360 voters in Pennsylvania if they were going to vote for Biden or Trump. We will code a vote for Biden `\(=1\)`, so the proportion `\(\coral{\hat{p}}\)` is the proportion of people who will vote for biden. Biden will win Pennsylvania if the population portion is `\(p > .5\)`. They find that `\(\coral{\hat{p}} = .53\)`. What is the sampling distrubtion of `\(\coral{\hat{p}}\)`? Check the conditions: - SRS `\(\checkmark\)` - number of success ($1360 * 0.53$) and failures ($1360 * 0.47$) are both larger than `\(15\)` `\(\checkmark\)` So we can go ahead and calculate a 95% confidence interval for the population parameter `\(p\)`... --- ## Clicker Question We are given that `\(n = 670\)`, `\(\coral{\hat{p}} = 0.85\)`, we will use the standard error of the sample proportion as $$ SE_{\coral{\hat{p}}}=\sqrt{\coral{\hat{p}}(1-\coral{\hat{p}})/n} $$ Which of the following is the correct calculation for a 95% confidence interval? <ol type = "a"> <li> \( 0.85 \pm 1.96 \cdot \sqrt{\frac{0.85\cdot 0.15}{670}} \) </li> <li> \( 0.85 \pm 1.645 \cdot \sqrt{\frac{0.85\cdot 0.15}{670}} \) </li> <li> \( 0.85 \pm 1.96 \cdot \frac{0.85\cdot 0.15}{\sqrt{670}} \) </li> <li> \( 571 \pm 1.96 \cdot \sqrt{\frac{571\cdot 99}{670}} \) </li> </ol> --- ## Hypothesis Testing We design a hypothesis test such as: $$ H_0: \coral{\hat{p}} = p_0 \ \text{ vs. } \ H_1: \coral{\hat{p}} \neq p_0 $$ Or one-sided alternatives, such as: `\(\coral{\hat{p}} < p_0\)` or `\(\coral{\hat{p}} > p_0\)`. We reject `\(H_0\)` if our p-value is lower than our *level of significance* - p-value: probability of calculating the sample proportion we have, or more extreme value, *given* the null hypothesis is true --- ## Test Statistic Draw an SRS of size `\(n\)` from a large population that contains an unknown proportion `\(p\)` of successes. To test the hypothesis `\(H_0: \coral{\hat{p}} = p_0\)`, compute the following z-statistic: $$ z=\frac{\coral{\hat{p}}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} $$ Look up this `\(z\)` value in the `\(z\)`-table when the sample size `\(n\)` is so large that both `\(n \cdot p_0\)` and `\(n \cdot (1-p_0) = 15\)` or more. --- ## Example A survey found that 571 out of 670 (85%) of Americans answered a question on experimental design correctly. Do these data provide convincing evidence that more than 80% of Americans have a good intuition about experimental design? $$ H_0: p=0.8 \ \text{ vs. } \ H_1: p>0.8 $$ --- ## Example A survey found that 571 out of 670 (85%) of Americans answered a question on experimental design correctly. Do these data provide convincing evidence that more than 80% of Americans have a good intuition about experimental design? $$ H_0: p = 0.8 \ \text{ vs. } \ H_1: p > 0.8 $$ Calculate the the p-value: $$ P(\coral{\hat{p}} > 0.85 \ \vert \ p = 0.8) $$ $$ P(z > \frac{0.85-0.8}{\sqrt{\frac{0.8 \cdot 0.2}{670}}}) = P(z > 3.25) = 0.0006 $$ Since `\(p\)`-value `\(= 0.0006 < \alpha = 0.05\)`, reject `\(H_0\)`. --- ## Practice On Nov. 1st, the New York Times and Siena College released a poll for Wisconsin with `\(n = 1253\)` and the sample proportion of people supporting Biden was `\(\coral{\hat{p}} = 0.52\)`. On election day, we learned the population proportion supporting Biden was `\(p = 0.495\)`. Would we have rejected the following hypothesis at the `\(\alpha = 0.05\)` significance level? $$ H_0: p = 0.495 $$ $$ H_1: p > 0.495 $$ --- ## Election Polling and Simple Random Sample When do we reject a true null hypothesis? - .hi.purple[Undercoverage]: when some groups in the population are left out of the process of choosing the sample - .hi.purple[Oversampling]: when some groups are sampled more often than others in a way that is not representative of the population - .hi.purple[Nonresponse]: when an individual chosen for the sample can't be contacted or refuses to participate - .hi.purple[Response Bias]: a systematic pattern of incorrect responses in a sample survey - .hi.purple[Wording Effect]: a systematic pattern of responses due to poor (or manipulated) wording of survey questions --- class: center, middle, sydney-blue #Comparing Two Proportions --- ## Notation We will use notation similar to that used in our study of two-sample t-statistics.
Population
Pop. Proportion
Sample Size
Sample Proportion
1
\( p_1 \)
\( n_1 \)
\( \hat{p}_1 \)
2
\( p_2 \)
\( n_2 \)
\( \hat{p}_2 \)
--- ## Sampling Distribution of Sample Proportion Review `\(X \sim B(1, p)\)` is the underlying variable. $$ \hat{p} = \frac{\sum \text{\# of successes}}{n} $$ The sample distribution of `\(\hat{p}\)` with population proportion `\(p_0\)`: $$ \hat{p} \sim N( p_0, \frac{p_0(1-p_0)}{n}) $$ --- ## Sampling Distribution of a Difference between Proportions To use `\(\hat{p}_1 - \hat{p}_2\)` for inference we use the following information: - When the samples are large, the distribution of `\(\hat{p}_1 - \hat{p}_2\)` is .hi.purple[approximately normal] - The .hi.purple[mean] of the sampling distribution is: `\(p_1 - p_2\)` - Assuming the two populations are independent, the .hi.purple[standard deviation] of the distribution is: $$ \sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}} $$ --- ## Large-Sample Confidence Intervals for Comparing Proportions Using the equation for standard error: $$ \mathit{SE} = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}} $$ The confidence interval is constructed as: $$ \hat{p}_1 - \hat{p}_2 \pm z^* \mathit{SE}, $$ where `\(z^*\)` is the associated critical value. --- ## Example Construct a 95% confidence interval for the following difference in proportions:
Population
No. Successes
Sample Size
Sample Proportion
1
75
100
\( \hat{p}_1 \) = 0.75
2
56
100
\( \hat{p}_2 \) = 0.56
$$ \mathit{SE}=\sqrt{\frac{(0.75)(0.25)}{100}+\frac{(0.56)(0.44)}{100}}=0.0659 $$ Confidence interval = `\((0.75-0.56) \pm (1.96)(0.0659) \implies [0.06, 0.32]\)` --- ## Significance Tests for Comparing Proportions $$ H_0: p_1 - p_2 = 0 $$ $$ H_1: p_1 - p_2 \neq 0 $$ In order to test the hypothesis, we must first calculated the .hi.purple[pooled sample proportion] $$ \hat{p}=\frac{\text{number of successes in both samples combined}}{\text{number of individuals in both samples combined}} $$ Then we use the following z-statistic: $$ z=\frac{\hat{p}_1-\hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}} $$ --- ## Example
Population
No. Successes
Sample Size
Sample Proportion
1
212
616
\( \hat{p}_1 \) = 0.344
2
7
49
\( \hat{p}_2 \) = 0.143
- Calculate `\(\hat{p}\)` $$ \hat{p} = \frac{212+7}{616+49} = 0.329 $$ - Calculate `\(z\)`-statistic $$ z= \frac{0.344-0.143}{\sqrt{(0.329)(0.671) \left(\frac{1}{616}+\frac{1}{49}\right)} } = 2.88 $$ --- ## Example .subheader.alice[Continued] The z-statistic was 2.88, and we have a two-tailed alternative hypothesis. Therefore: $$ \text{p-value } = 2\cdot P(z>2.88) = 2\cdot 0.002 = 0.004 $$ Therefore we reject null at `\(\alpha = 0.05\)`, since p-value `\(< \alpha\)`