Mind on Statistics (6th. Ed) Chapter 2 - Turning Data into Information
by Arpon Sarker
Introduction
The objectives are:
- Distinguish between categorical variables and quantitative variables
- Create visual and numerical summaries for one or two categorical variables
- Create visual displays for one quantitative variable
- Describe a dataset with one quantitative variable using numerical summaries
- Explain how to identify and manage outliers
- Describe how the mean, standard deviation, z-scores, and the Empirical Rule are used to display the possible values in a bell-shaped distribution.
Definitions
Raw Data: term used for numbers and category labels that have been collected but have not yet been processed.
Variable: a characteristic that can differ from one individual to the next
Observational Unit/Observation: a single individual entity in a study
Statistic: A summary measure computed from sample data
Parameter: A summary measured using entire population
Distribution: describes how often the possible responses of a variable occur. This is either a frequency distribution (counts) or relative frequency distribution (percentages)
Percentile: kth percentile is a number that has k% of the data values at or below it and (100-k%) of the data values at or below it.
Visual Summaries
Categorical Variables
- Pie Charts (single categorical variable)
- Bar Graphs (one or two categorical variables)
Quantitative Variables
Distribution is based on location, spread and shape of data.
- Histograms (excellent for judging shape, flexibility in choosing interval widths; may not fill in well with small sample size and may not see true shape of data)
- Stem-and-leaf plots (excellent for sorting data; may be too cluttered with large sample size, restricted in interval widths)
- Dot plots (See all individual data values and is easy to create; to cluttered with large sample size)
- Boxplots (summarise 5-number summary for location and spread, outliers identified; cannot judge shape such as whether data is bimodal/bell-shaped)
Numerical Summaries of Quantitative Variables
Location
Mean: \(\bar{x}=\frac{\sum{x_i}}{n}\) Median: Find middle value in ordered dataset
If dataset is skewed then mean and median are not equal.
Spread
Range = high value - low value
Interquartile Range (IQR) = upper quartile $Q_3$ - lower quartile $Q_1$
- $Q_3$: median of values above median (75th percentile)
- $Q_1$: median of values below median (25th percentile)
- Outliers found when data exceed $Q_1-1.5\cdot \textrm{IQR}$ and $Q_3+1.5\cdot \textrm{IQR}$
Standard Deviation
How to Handle Outliers
If the outlier is a legitimate data value and represents natural variability for the group and variables measured: Do not discard legitimate values unless goal is to study only a partial range of the possible values.
If a mistake was made while taking measurements or entering into the computer: Outliers should be corrected and retained, otherwise discard.
The individual in question belongs to a different group than the bulk of individuals measured: consider reason for studying the data in deciding whether to discard or not.
Bell-Shaped Distributions and Standard Deviations
Sample standard deviation \(s = \sqrt{\frac{\sum{(x_i-\bar{x})^2}}{n-1}}\) Sample variance \(s^2=\frac{\sum{(x_i-\bar{x})^2}}{n-1}\)
Empirical Rule:
- 68% of values fall within 1 standard deviation of the mean in either direction (-1 to 1 for z-score)
- 95% of values fall within 2 standard deviation of the mean in either direction
- 99.7% of values fall within 3 standard deviation of the mean in either direction
The Empirical Rule implies range from minimum to maximum data values equals 4 to 6 standard deviations so for relatively large samples, you can get a rough idea of the value of the standard deviation: \(s \approx \frac{range}{6}\)
A standardised score or z-score measures how far a value is from the mean in terms of standard deviations.
tags: mathematics - statistics