Statistics Fundamentals

Statistics is the grammar of science. Whether you're analysing experimental data, building machine-learning models, or simply reading a poll, a handful of core ideas are all you need to make sense of numbers. This article covers the essentials — with interactive simulations to build real intuition.

1. Central Tendency — Where Is the Data?

When you have a dataset, the first question is: what is a typical value? Three measures answer this differently.

Mean — the arithmetic average. Sum all values and divide by $n$ :

$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$

The mean is sensitive to outliers. One extreme value can pull it far from the bulk of the data.

Median — the middle value when the data is sorted. If $n$ is even, it's the average of the two middle values. The median is robust: a single outlier can't move it much. This is why economists report median household income rather than mean — a handful of billionaires would inflate the mean beyond any useful meaning.

Mode — the most frequently occurring value. Useful for categorical data (e.g., most popular shoe size) and for identifying peaks in a distribution.

Try it: Switch between Symmetric, Right-skewed, and Left-skewed distributions. Notice how the mean chases the long tail while the median stays closer to the centre of mass.

2. Standard Deviation — How Spread Out Is the Data?

The mean tells you where the data sits. The standard deviation ( $\sigma$ ) tells you how spread out it is around that centre.

$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2}$

The formula first computes the average squared distance from the mean (the variance, $\sigma^2$ ), then takes the square root to return to the original units.

A small $\sigma$ means values cluster tightly around the mean. A large $\sigma$ means they're spread wide. On a normal distribution, increasing $\sigma$ flattens and widens the bell curve — but the total area underneath always stays exactly 1 (it's a probability distribution).

Try it: Set σ₁ = 1 and σ₂ = 4. The wider curve has the same centre and the same total area — it just distributes probability over a larger range.

3. The Empirical Rule — 68–95–99.7

For a normal distribution (the symmetric bell curve), the proportion of data within each σ band follows a remarkably clean pattern:

Range	Approx.	Exact
$\mu \pm 1\sigma$	~68%	68.27%
$\mu \pm 2\sigma$	~95%	95.45%
$\mu \pm 3\sigma$	~99.7%	99.73%

This is sometimes called the 68-95-99.7 rule or the three-sigma rule. It shows up everywhere:

In manufacturing, "six sigma" quality means fewer than 3.4 defects per million (±6σ from the target).
In particle physics, a "5σ discovery threshold" means the probability of seeing the result by chance is less than 1 in 3.5 million.
In finance, a "2σ event" in asset returns is expected to happen roughly 1 in 20 trading days.

Try it: Toggle between 1σ, 2σ, and 3σ ranges. The annotation shows the exact percentage of the distribution contained in each band.

4. Z-Scores — Standardising Any Value

A Z-score converts any measurement into a universal scale: how many standard deviations from the mean is this value?

$z = \frac{x - \mu}{\sigma}$

This lets you compare values from completely different scales. A student scoring 72 on a test with $\mu = 65$ , $\sigma = 5$ has $z = 1.4$ — moderately above average. A patient with a blood pressure reading can be standardised to the same scale for comparison.

The colour indicator in the simulation follows a common convention:

$|z| < 1$ : typical — most values land here (68% of the distribution)
$|z| \geq 1$ : moderate
$|z| \geq 2$ : unusual — only ~5% of values are this far out
$|z| \geq 3$ : extreme — less than 0.3% probability under a normal distribution

Try it: Move the value $x$ while keeping μ and σ fixed. Watch the z-score and classification update in real time.

5. Confidence Intervals — Estimating the Unknown

In practice, you rarely know the true population mean $\mu$ . You draw a sample and compute its mean $\bar{x}$ . The question is: how close is $\bar{x}$ to $\mu$ ?

A confidence interval gives a plausible range for the true parameter:

$\text{CI} = \bar{x} \pm z^* \cdot \frac{\sigma}{\sqrt{n}}$

where:

$\sigma / \sqrt{n}$ is the standard error (SE) — the standard deviation of the sample mean
$z^*$ is the critical value: 1.645 for 90%, 1.96 for 95%, 2.576 for 99% confidence

A common misconception: "95% confidence" does not mean there's a 95% chance the true mean falls in this particular interval. The true mean either is or isn't in the interval — it's fixed. What it means is: if you repeated the experiment many times and built a CI each time, 95% of those intervals would contain the true mean.

Two levers narrow the interval:

Increase $n$ (more data) — SE shrinks as $1/\sqrt{n}$
Decrease confidence level — a 90% CI is narrower than a 99% CI, but less reliable

Try it: Drag the sample size slider from 5 to 500. The interval collapses as data accumulates.

Interactive Simulations

1. Central Tendency

Loading chart...

Distribution

Sample Size500

Add Outlier (200)

Mean: 49.87

Median: 49.32

Mode: 48.03

Std Dev: 10.09

2. Standard Deviation

Loading chart...

Mean (μ)0

Sigma 1 (σ₁)1

Sigma 2 (σ₂)2

3. Empirical Rule (68-95-99.7)

Loading chart...

Mean (μ)0

Std Dev (σ)1

Sigma Range

1σ2σ3σ

4. Z-Scores

Loading chart...

Value (x)

Mean (μ)

Std Dev (σ)

z = (x − μ) / σ = 1.400

Moderate (|z| ≥ 1)

5. Confidence Intervals

Loading chart...

Sample Mean (x̄)

Population σ

Sample Size (n)30

Confidence Level

Std Error: 2.739

Margin of Error: ±5.368

CI: [94.63, 105.37]

6. Playground

Loading chart...

True Mean (μ)0

True Std Dev (σ)10

Sample Size500

Sample Mean: -0.13

Sample Std: 10.09

Sample Median: -0.68

True μ: 0.0

True σ: 10.0

Quick Reference

Concept	Formula	What it tells you
Mean	$\bar{x} = \sum x_i / n$	Arithmetic average
Median	Middle value (sorted)	Robust centre (outlier-resistant)
Std Deviation	$\sigma = \sqrt{\sum(x_i-\bar{x})^2/n}$	Spread in original units
Z-score	$z = (x - \mu)/\sigma$	Standard deviations from the mean
Standard Error	$SE = \sigma/\sqrt{n}$	Uncertainty of the sample mean
95% CI	$\bar{x} \pm 1.96 \cdot SE$	Plausible range for the true mean

Statistics Fundamentals

1. Central Tendency — Where Is the Data?

2. Standard Deviation — How Spread Out Is the Data?

3. The Empirical Rule — 68–95–99.7

4. Z-Scores — Standardising Any Value

5. Confidence Intervals — Estimating the Unknown

Interactive Simulations

1. Central Tendency

2. Standard Deviation

3. Empirical Rule (68-95-99.7)

4. Z-Scores

5. Confidence Intervals

6. Playground

Quick Reference

Further Reading