“Never trust summary statistics alone; always visualize your data”

Written on 8 November 2020, 12:46pm

Tagged with: ,

Recently, Alberto Cairo created the Datasaurus dataset which urges people to “never trust summary statistics alone; always visualize your data”, since, while the data exhibits normal seeming statistics, plotting the data reveals a picture of a dinosaur. These 13 datasets (the Datasaurus, plus 12 others) each have the same summary statistics (x/y mean, x/y standard deviation, and Pearson’s correlation) to two decimal places, while being drastically different in appearance. 

https://www.autodesk.com/research/publications/same-stats-different-graphs

Below I try to understand the highlighted concepts.

Mean and deviation

The concepts of population, sample, mean, variance and standard deviation (SD):

https://web.mit.edu/~csvoss/Public/usabo/stats_handout.pdf

Same mean but different variance and SD:

https://en.wikipedia.org/wiki/Variance

Examples of means and SDs from mathisfun.com:

A population of dogs, with the heights at the shoulders of 600mm, 470mm, 170mm, 430mm and 300mm. The mean (average height) is 394mm

dogs on graph: standard deviation

The variance is 21704mm, and the standard deviation is 147mm:

Population vs sample

However if the data is just a sample of the population, then the variance and standard deviation change (see the cheat sheet, with the difference between sigma and s):

The 68–95–99.7 rule

For the normal distribution (Gauss/bell curve), approx. 68% of the values are within 1 standard deviation from the mean, 95% are within 2 SDs and 99.7% are within 3 SDs:

https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule

Examples:

https://www.mathsisfun.com/data/standard-normal-distribution.html

The correlation coefficient (or Pearson’s correlation)

https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/

Coefficient of determination

The square of the correlation coefficient is the coefficient of determination: R2 or R-squared.

R2 is a statistic that will give some information about how well a model fits a set of observations. In regression, the R2 coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly fit the data.

Simply put, the R-squared indicate the percentage of data points falling on the regression line.

https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/

Read more:

Leave a response