class: center, middle, inverse, title-slide .title[ # BANL 6100: Business Analytics ] .subtitle[ ## Describing Data – II ] .author[ ### Mehmet Balcilar
mbalcilar@newhaven.edu
] .institute[ ### Univeristy of New Haven ] .date[ ### 2023-09-28 (updated: 2023-10-25) ] --- class: center, middle, sydney-blue # Skewness & Kurtosis --- ## Introduction - a normal distribution to be symmetrical in shape - real-world data are often have asymmetric distributions - asymmetry in a distribution is measure by skewness - kurtosis (peakedness) defines whether a distribution is truly **normal** - or whether it may have so-called **fatter** or **thinner** tails --- ## Symmetric Distributions – I - A distribution is symmetric if the relative frequency or probability of certain values are equal at equal distances from the point of symmetry. - The point of symmetry for normal distributions is the mean (and at the same time median and mode!) - The most common symmetric distribution is the normal distribution. - However, there are a number of other distributions that are symmetric. .red[A symmetric distribution has the following property:] `$$\color{blue}{Q_3-Q_2=Q_2-Q_1}$$` where `\(Q_1\)`, `\(Q_2\)`, and `\(Q_3\)` are 1st, 2nd, and 3rd quartiles. Thus the ratio `\(\color{blue}{(Q_3-Q_2)/(Q_2-Q_1)}\)` can be used as a measure of asymmetry. --- ## Symmetric Distributions – II Have a look at following histogram: <img src="images/lecture8/retirement.png" width="650"> This distribution meets all of the conditions of being symmetrical. --- ## Skewness - Skewness is the degree of distortion or deviation from the symmetrical normal distribution. - Skewness can be seen as a measure to calculate the lack of symmetry in the data distribution. - Skewness helps you identify extreme values in one of the tails. Symmetrical distributions have a skewness of 0. .pull-left[ #### Positive Skewness - A distribution is **positively (right) skewed** when the tail on the right side of the distribution is longer (also often called "fatter"). - When there is positive skewness, the mean and median are bigger than the mode. ] .pull-right[ #### Negative Skewness - Distributions are **negatively (left) skewed** when the tail on the left side of the distribution is longer or fatter than the tail on the right side. - When there is negative skewness, the mean and median are smaller than the mode. ] --- ## Typs of Skewness <img src = "images/lecture8/skewness.png" width = "1000"> --- ## Fisher-Pearson coefficient of skewness For univariate data `\(x_1, x_2, ..., x_n\)` the formula for skewness is: `$$g_1=\dfrac{\dfrac{1}{n}{\displaystyle\sum^n_{i=1}(x_i-\bar{x})^3}}{s^3}$$` where `\(\bar{x}\)` is the mean, `\(s\)` is the standard deviation, and `\(n\)` is the number of data points. The **Fisher-Pearson coefficient of skewness** is the .red[most commonly] used measure of skewness. --- ## Interpreting Fisher-Pearson coefficient of skewness The rule of thumb: * A skewness between -0.5 and 0.5 means that the data are pretty symmetrical * A skewness between -1 and -0.5 (negatively skewed) or between 0.5 and 1 (positively skewed) means that the data are moderately skewed. * A skewness smaller than -1 (negatively skewed) or bigger than 1 (positively skewed) means that the data are highly skewed. --- ## Pearson Mode Skewness The Pearson mode skewness is used when a strong mode is exhibited by the sample data. For univariate data `\(x_1, x_2, ..., x_n\)` the formula for Pearson mode Skewness is: `$$\mathit{Sk}_1=\dfrac{\bar{x}-m_o}{s}$$` where `\(\bar{x}\)` is the mean, `\(s\)` is the standard deviation, and `\(m_o\)` is the mode of data points. #### Interpretation: - The direction of skewness is given by the sign. - The coefficient compares the sample distribution with a normal distribution. The larger the value, the larger the distribution differs from a normal distribution. - A value of zero means no skewness at all. - A large negative value means the distribution is negatively skewed. - A large positive value means the distribution is positively skewed. --- ## Pearson's Second Coefficient (Pearson Median Skewness) Pearson's second coefficient is used when the data includes multiple modes or a weak mode. For univariate data `\(x_1, x_2, ..., x_n\)` the formula for Pearson mode Skewness is: `$$\mathit{Sk}_2=\dfrac{3(\bar{x}-m_d)}{s}$$` where `\(\bar{x}\)` is the mean, `\(s\)` is the standard deviation, and `\(m_d\)` is the median of data points. It has the sam interpretation as the Pearson mode skewness. --- ## Remedies for Skewewness <!-- One reason you might check if a distribution is skewed is to verify whether your data is appropriate for a certain statistical procedure. Many statistical procedures assume that variables or residuals are normally distributed. Skew is a common way that a distribution can differ from a normal distribution. !--> You generally have three choices if your statistical procedure requires a normal distribution and your data is skewed: .bold[Do nothing.] Many statistical tests, including t tests, ANOVAs, and linear regressions, aren’t very sensitive to skewed data. Especially if the skew is mild or moderate, it may be best to ignore it. .bold[Use a different model.] You may want to choose a model that doesn’t assume a normal distribution. Non-parametric tests or generalized linear models could be more appropriate for your data. .bold[Transform the variable.] Another option is to transform a skewed variable so that it’s less skewed. “Transform” means to apply the same function to all the observations of a variable. --- ## Transformations Based on the Type of Skewness <table> <tbody> <tr> <td><b>Type of skew</b></td> <td><b>Intensity of skew</b></td> <td><b>Transformation</b></td> </tr> <tr> <td rowspan="4"><span style="font-weight: 400;">Right</span></td> <td><span style="font-weight: 400;">Mild</span></td> <td><span style="font-weight: 400;">Do not transform</span></td> </tr> <tr> <td><span style="font-weight: 400;">Moderate</span></td> <td><span style="font-weight: 400;">Square root</span></td> </tr> <tr> <td><span style="font-weight: 400;">Strong</span></td> <td><span style="font-weight: 400;">Natural log</span></td> </tr> <tr> <td><span style="font-weight: 400;">Very strong</span></td> <td><span style="font-weight: 400;">Log base 10</span></td> </tr> <tr> <td rowspan="4"><span style="font-weight: 400;">Left</span></td> <td><span style="font-weight: 400;">Mild</span></td> <td><span style="font-weight: 400;">Do not transform</span></td> </tr> <tr> <td><span style="font-weight: 400;">Moderate</span></td> <td><span style="font-weight: 400;">Reflect* then square root</span></td> </tr> <tr> <td><span style="font-weight: 400;">Strong</span></td> <td><span style="font-weight: 400;">Reflect* then natural log</span></td> </tr> <tr> <td><span style="font-weight: 400;">Very strong</span></td> <td><span style="font-weight: 400;">Reflect* then log base 10</span></td> </tr> </tbody> </table> *In this context, “reflect” means to take the largest observation, `\(x_l\)`, then subtract each observation from `\(x_l + 1\)`. Keep in mind that the reflection reverses the direction of the variable and its relationships with other variables (i.e., positive relationships become negative). --- ## Kurtosis .pull-left[ - Kurtosis deals with the lengths of tails in the distribution. - It is a measure of peakedness (or tailedness) of the distribution relative to a normal distribution > **Where skewness talks about extreme values in one tail versus the other, kurtosis aims at identifying extreme values in both tails at the same time!** - You can think of Kurtosis as a **measure of outliers** present in the distribution. ] .pull-right[ <img src = "images/lecture8/kurtosis.png" width ="550"> The distribution denoted in the image above has relatively more observations around the mean, then a steep decline and longer tails compared to the normal distribution. ] --- ## Measuring Kurtosis For univariate data `\(x_1, x_2, \dots, x_n\)` the formula for kurtosis is: `$$k=\dfrac{\dfrac{1}{n}{\displaystyle\sum^n_{i=1}(x_i-\bar{x})^4}}{s^4}$$` If there is a high kurtosis, then you may want to investigate why there are so many outliers. <!-- The presence of outliers could be indications of errors on the one hand, but they could also be some interesting observations that may need to be explored further. For banking transactions, for example, an outlier may signify fraudulent activity. How we deal with outliers mainly depends on the domain. --> Low kurtosis in a data set is an indication that data has light tails or lacks outliers. If we get low kurtosis, then also we need to investigate and trim the dataset of unwanted results. #### Excess kurtosis In practic, **excess kurtosis**, which is defined as Pearson's kurtosis minus 3, to provide a simple comparison to the normal distribution. `$$k_e=k-3=\dfrac{\dfrac{1}{n}{\displaystyle\sum^n_{i=1}(x_i-\bar{x})^3}}{s^3}-3$$` --- ## Types of Kurtosis .pull-left[ <img src = "images/lecture8/mesokurtosis.png" width ="550"> .red[Mesokurtic] `\(\color{red}{(k \approx 3)}\)` A mesokurtic distribution has kurtosis statistics that lie close to the ones of a normal distribution. Mesokurtic distributions have a kurtosis of around 3. According to this definition, the standard normal distribution has a kurtosis of 3. ] .pull-right[ .red[Platykurtic] `\(\color{red}{(k < 3)}\)` When a distribution is platykurtic, the distribution is shorter and tails are thinner than the normal distribution. The peak is lower and broader than Mesokurtic, which means that the tails are light and that there are fewer outliers than in a normal distribution. .red[Leptokurtic] `\(\color{red}{(k > 3)}\)` When you have a leptokurtic distribution, you have a distribution with longer and fatter tails. The peak is higher and sharper than the peak of a normal distribution, which means that data have heavy tails and that there are more outliers. <!-- Outliers stretch your horizontal axis of the distribution, which means that the majority of the data appear in a narrower vertical range. This is why the leptokurtic distribution looks "skinny". --> ] --- ## Types of Kurtosis <table> <thead> <tr> <th rowspan="2"></th> <th colspan="3" style="text-align: center;">Category</th> </tr> <tr> <th>Mesokurtic </th> <th> Platykurtic </th> <th>Leptokurtic </th> </tr> </thead> <tbody> <tr> <th style="text-align: left;">Tailedness</th> <td>Medium-tailed</td> <td>Thin-tailed</td> <td>Fat-tailed</td> </tr> <tr> <th style="text-align: left;">Outlier frequency</th> <td>Medium</td> <td>Low</td> <td>High</td> </tr> <tr> <th style="text-align: left;">Kurtosis</th> <td>Moderate (3)</td> <td>Low (< 3)</td> <td>High (> 3)</td> </tr> <tr> <th style="text-align: left;">Excess kurtosis</th> <td>0</td> <td>Negative</td> <td>Positive</td> </tr> <tr> <th style="text-align: left;">Example distribution</th> <td>Normal</td> <td>Uniform</td> <td>Laplace</td> </tr> </tbody> </table>