The normal distribution is arguably the most used distribution in statistics. A lot of statistical methods rely on assuming that your data is normally distributed.  What is so special about it?

The infamous bell

The normal distribution is characterized by its trademark bell-shaped curve.

The shape of the bell curve is dictated by two parameters. First is the mean, denoted as μ. The mean determines where the peak of the distribution is. With the mean dictating the center of the distribution, a huge amount of data points have values that are near the mean’s value.

The other parameter is the variance, denoted by σ2. The variance dictates how wide the distribution is. Larger values of the variance leads to wider curves. Lower values do the exact opposite.

This is because for larger values of variance, the data points are more dispersed, so a wider curve is needed to cover them. On the other hand, a narrower curve will suffice for data that are not dispersed.

But why?

Why is this bell-shaped curve so famous? A lot of measurements naturally follow a normal distribution:

  • Height measurements
  • Weight measurements
  • IQ measurements
  • Standardized exam scores
  • Measurement errors (usually)

Another reason why normal distributions are widely used is convenience. Normal distributions are easy to work with in comparison to other distributions.

A bonus is that a considerable number of statistical methods that have underlying normality assumptions are robust when sample sizes are large. This means that for sufficiently large sample sizes, slightly departing from the assumption of normality won’t gravely affect the power of your statistical tool.

Nevertheless, using normal distributions can easily be abused and misused. Some researchers choose to assume normality even if the distributions are not normal at all. This can lead to false findings and conclusions.

As with the other statistical tools and assumptions you will encounter as an analyst, you need to be fully aware of the pros and cons of using the assumption of normality. This is to make sure you extract the correct judgment from the data you have collected.

Previous Google I/O 2019 | TF-Agents: A Flexible Reinforcement Learning Library for TensorFlow
Next PyCon 2019 | Pandas Is For Everyone