Representing and Comparing Data Distributions

Choosing the Right Graph

Every time you check a weather app, you are reading graphs chosen specifically for continuous data like temperature or discrete data like the number of rainy days. Selecting the correct representation for univariate data (data involving a single variable) depends entirely on the type of data you have.

Discrete data can only take specific, separate values (e.g., shoe size, number of children). Continuous data can take any value within a range and is usually measured (e.g., time, mass, height).

When graphing these datasets, Edexcel requires specific formats:

Discrete Ungrouped Data: Represented using vertical line charts where the height corresponds to the frequency.
Discrete Grouped Data: Represented using bar charts. You must leave gaps between the bars to indicate distinct, separate values.
Continuous Grouped Data: Represented using histograms (with no gaps between bars) or frequency polygons.

Constructing Polygons and Cumulative Frequency Curves

To draw a frequency polygon, points are plotted at the midpoint of the class interval. The midpoint is found by adding the lower and upper bounds and dividing by two: $\frac{\text{Lower Bound} + \text{Upper Bound}}{2}$ .

These points are joined with straight lines to create an "open" polygon. You should not join the first point to the last point, nor to the x-axis, unless the frequency is exactly 0 at that boundary.

In contrast, cumulative frequency graphs are plotted at the upper boundary of the class interval. The points are joined with a smooth, S-shaped curve (ogive) rather than straight lines.

Histograms (Higher Tier)

Why do some bar-style graphs have unequal widths? For continuous data with unequal groups, we use histograms where the area of the bar represents the frequency, rather than the height.

The height of the bar in a histogram represents the frequency density, and the vertical axis must be explicitly labelled with this term.

\text{Frequency Density} = \text{Frequency} \div \text{Class Width}

Worked Example: Finding Frequency Density

Calculate the frequency density for the class interval $10 \le t < 30$ with a frequency of 12.

Step 1: Calculate the class width.

$\text{Class Width} = 30 - 10 = 20$

Step 2: Substitute into the formula.

$\text{Frequency Density} = 12 \div 20$

Step 3: Calculate the final answer.

$\text{Frequency Density} = 0.6$

Sketching and Interpreting Box Plots

You can list a hundred data points, but a box plot summarizes the entire spread of the data using just five numbers. This is known as the five-number summary.

To construct a box plot, you need:

Minimum Value
Lower quartile ( $Q_1$ )
Median ( $Q_2$ )
Upper quartile ( $Q_3$ )
Maximum Value

Here is an approximate diagram of a box plot showing these key features:

            Lower Quartile                    Upper Quartile
Minimum          (Q1)           Median             (Q3)          Maximum
   |--------------+---------------|-----------------+---------------|
                  |                                 |
                  +---------------------------------+
                                  IQR

A rectangular box is drawn from the lower quartile to the upper quartile, showing the middle 50% of the data. A vertical line is drawn inside the box at the median, and straight "whiskers" extend from the box outwards to the minimum and maximum values.

If you are extracting these values from a cumulative frequency curve with a total frequency of $n$ , the median is read at $0.5n$ , the lower quartile at $0.25n$ , and the upper quartile at $0.75n$ . For discrete data, the median position is calculated as $\frac{n+1}{2}$ .

Outliers and Skewness (Higher Tier)

Sometimes the highest or lowest values in a dataset do not fit the general pattern and are classified as an outlier. Edexcel commonly uses the "1.5 $\times$ IQR" rule to identify them, where the interquartile range (IQR) is calculated as $Q_3 - Q_1$ .

A value is an outlier if it falls outside these boundaries:

Lower Boundary: $< Q_1 - (1.5 \times \text{IQR})$
Upper Boundary: $> Q_3 + (1.5 \times \text{IQR})$

On a box plot, outliers are marked with a distinct cross (X). If an outlier is present, the whisker must end at the smallest or largest value in the dataset that is not an outlier, rather than extending to the outlier boundary itself.

The position of the median inside the box also indicates skewness:

Positive Skew: The right tail is longer, and the median is closer to the lower quartile.
Negative Skew: The left tail is longer, and the median is closer to the upper quartile.
Symmetrical: The median is exactly in the center of the box.

Comparing Distributions

When asked to compare two distributions of data, Edexcel mark schemes strictly require exactly two distinct points of comparison. You must interpret these comparisons in the context of the original problem.

Use the following paired structure to ensure you secure full marks:

Comparison Point	Metric to Use	What It Means in Context
Central Tendency (Average)	Compare the Medians. Do not use the mean when comparing box plots.	States which group was higher/lower on average. Example: "The median for Class A (15) is higher than Class B (12), so on average, Class A scored more marks."
Spread (Consistency)	Compare the Interquartile Range (IQR) or the Range.	A smaller IQR/Range means the data is more consistent. Example: "The IQR for Class A (4) is lower than Class B (7), so Class A's marks were more consistent."

Sign up to continue reading

Get full access to revision notes, key terms, and exam tips for every subtopic.

Exam Tips

Common Mistake
Students often lose marks by plotting cumulative frequency at the midpoint of the interval. It must always be plotted at the upper boundary.
2
In 'Compare' questions, simply listing the values of the median or IQR will score 0 marks. You must use comparative words like 'higher than' or 'smaller than' alongside the specific numbers.
3
When asked to compare the spread, always explicitly state that a smaller IQR or range means the data is 'more consistent' or has 'less variation'. Avoid vague terms like 'better' or 'worse'.
4
For Higher Tier histograms, you will often be awarded a specific B1 or C1 mark just for correctly labeling the y-axis as 'Frequency Density'.
5
Edexcel mark schemes generally require a smooth curve for cumulative frequency graphs. Drawing straight lines between the points with a ruler may result in losing marks.

Key Terms(17)

Univariate data: Data that involves a single variable, such as the heights of a group of students.
Discrete data: Data that can only take specific, separate values, such as shoe sizes or the number of children in a family.
Continuous data: Data that can take any value within a range and is usually measured, such as time, mass, or height.
Histogram: A graphical representation of continuous grouped data where the area of the bar represents the frequency and there are no gaps between bars.
Frequency density: The height of a bar in a histogram, calculated by dividing the frequency by the class width.
Class interval: The range of values defining a specific group in a grouped frequency table.
Midpoint: The exact halfway value of a class interval, used for plotting frequency polygons.
Cumulative frequency: A running total of frequencies, plotted at the upper boundary of class intervals to form a smooth S-shaped curve.
Box plot: A diagram that summarizes the distribution of a dataset using a five-number summary.
Five-number summary: The five key values needed to draw a box plot: minimum, lower quartile, median, upper quartile, and maximum.
Lower quartile (Q₁): The value that is one-quarter of the way through an ordered dataset.
Median: The middle value of an ordered dataset, representing the central tendency.
Upper quartile (Q₃): The value that is three-quarters of the way through an ordered dataset.
Interquartile range (IQR): A measure of spread representing the middle 50% of the data, calculated by subtracting the lower quartile from the upper quartile.
Spread: A measure of how scattered or varied the data points are, typically described using the range or interquartile range.
Outlier: An extreme value that does not fit the general pattern of a dataset.
Skewness: A measure of the asymmetry of a distribution, visually indicated on a box plot by the position of the median relative to the quartiles.

Previous NoteMeasures of Spread and Outliers Next Subtopic: Population DescriptionPopulation Description

Back to Data Distributions

Put your knowledge into practice — try past paper questions for Mathematics

Key Terms(17)

Univariate data: Data that involves a single variable, such as the heights of a group of students.
Discrete data: Data that can only take specific, separate values, such as shoe sizes or the number of children in a family.
Continuous data: Data that can take any value within a range and is usually measured, such as time, mass, or height.
Histogram: A graphical representation of continuous grouped data where the area of the bar represents the frequency and there are no gaps between bars.
Frequency density: The height of a bar in a histogram, calculated by dividing the frequency by the class width.
Class interval: The range of values defining a specific group in a grouped frequency table.
Midpoint: The exact halfway value of a class interval, used for plotting frequency polygons.
Cumulative frequency: A running total of frequencies, plotted at the upper boundary of class intervals to form a smooth S-shaped curve.
Box plot: A diagram that summarizes the distribution of a dataset using a five-number summary.
Five-number summary: The five key values needed to draw a box plot: minimum, lower quartile, median, upper quartile, and maximum.
Lower quartile (Q₁): The value that is one-quarter of the way through an ordered dataset.
Median: The middle value of an ordered dataset, representing the central tendency.
Upper quartile (Q₃): The value that is three-quarters of the way through an ordered dataset.
Interquartile range (IQR): A measure of spread representing the middle 50% of the data, calculated by subtracting the lower quartile from the upper quartile.
Spread: A measure of how scattered or varied the data points are, typically described using the range or interquartile range.
Outlier: An extreme value that does not fit the general pattern of a dataset.
Skewness: A measure of the asymmetry of a distribution, visually indicated on a box plot by the position of the median relative to the quartiles.

Representing and Comparing Data Distributions

Choosing the Right Graph

When graphing these datasets, Edexcel requires specific formats:

Discrete Ungrouped Data: Represented using vertical line charts where the height corresponds to the frequency.
Discrete Grouped Data: Represented using bar charts. You must leave gaps between the bars to indicate distinct, separate values.
Continuous Grouped Data: Represented using histograms (with no gaps between bars) or frequency polygons.

Constructing Polygons and Cumulative Frequency Curves

These points are joined with straight lines to create an "open" polygon. You should not join the first point to the last point, nor to the x-axis, unless the frequency is exactly 0 at that boundary.

In contrast, cumulative frequency graphs are plotted at the upper boundary of the class interval. The points are joined with a smooth, S-shaped curve (ogive) rather than straight lines.

Histograms (Higher Tier)

Why do some bar-style graphs have unequal widths? For continuous data with unequal groups, we use histograms where the area of the bar represents the frequency, rather than the height.

The height of the bar in a histogram represents the frequency density, and the vertical axis must be explicitly labelled with this term.

\text{Frequency Density} = \text{Frequency} \div \text{Class Width}

Worked Example: Finding Frequency Density

Calculate the frequency density for the class interval $10 \le t < 30$ with a frequency of 12.

Step 1: Calculate the class width.

$\text{Class Width} = 30 - 10 = 20$

Step 2: Substitute into the formula.

$\text{Frequency Density} = 12 \div 20$

Step 3: Calculate the final answer.

$\text{Frequency Density} = 0.6$

Sketching and Interpreting Box Plots

You can list a hundred data points, but a box plot summarizes the entire spread of the data using just five numbers. This is known as the five-number summary.

To construct a box plot, you need:

Minimum Value
Lower quartile ( $Q_1$ )
Median ( $Q_2$ )
Upper quartile ( $Q_3$ )
Maximum Value

Here is an approximate diagram of a box plot showing these key features:

            Lower Quartile                    Upper Quartile
Minimum          (Q1)           Median             (Q3)          Maximum
   |--------------+---------------|-----------------+---------------|
                  |                                 |
                  +---------------------------------+
                                  IQR

Outliers and Skewness (Higher Tier)

A value is an outlier if it falls outside these boundaries:

Lower Boundary: $< Q_1 - (1.5 \times \text{IQR})$
Upper Boundary: $> Q_3 + (1.5 \times \text{IQR})$

The position of the median inside the box also indicates skewness:

Positive Skew: The right tail is longer, and the median is closer to the lower quartile.
Negative Skew: The left tail is longer, and the median is closer to the upper quartile.
Symmetrical: The median is exactly in the center of the box.

Comparing Distributions

Use the following paired structure to ensure you secure full marks:

Comparison Point	Metric to Use	What It Means in Context
Central Tendency (Average)	Compare the Medians. Do not use the mean when comparing box plots.	States which group was higher/lower on average. Example: "The median for Class A (15) is higher than Class B (12), so on average, Class A scored more marks."
Spread (Consistency)	Compare the Interquartile Range (IQR) or the Range.	A smaller IQR/Range means the data is more consistent. Example: "The IQR for Class A (4) is lower than Class B (7), so Class A's marks were more consistent."

Sign up to continue reading

Get full access to revision notes, key terms, and exam tips for every subtopic.

Exam Tips

Common Mistake
Students often lose marks by plotting cumulative frequency at the midpoint of the interval. It must always be plotted at the upper boundary.
2
In 'Compare' questions, simply listing the values of the median or IQR will score 0 marks. You must use comparative words like 'higher than' or 'smaller than' alongside the specific numbers.
3
When asked to compare the spread, always explicitly state that a smaller IQR or range means the data is 'more consistent' or has 'less variation'. Avoid vague terms like 'better' or 'worse'.
4
For Higher Tier histograms, you will often be awarded a specific B1 or C1 mark just for correctly labeling the y-axis as 'Frequency Density'.
5
Edexcel mark schemes generally require a smooth curve for cumulative frequency graphs. Drawing straight lines between the points with a ruler may result in losing marks.

Key Terms(17)

Univariate data: Data that involves a single variable, such as the heights of a group of students.
Discrete data: Data that can only take specific, separate values, such as shoe sizes or the number of children in a family.
Continuous data: Data that can take any value within a range and is usually measured, such as time, mass, or height.
Histogram: A graphical representation of continuous grouped data where the area of the bar represents the frequency and there are no gaps between bars.
Frequency density: The height of a bar in a histogram, calculated by dividing the frequency by the class width.
Class interval: The range of values defining a specific group in a grouped frequency table.
Midpoint: The exact halfway value of a class interval, used for plotting frequency polygons.
Cumulative frequency: A running total of frequencies, plotted at the upper boundary of class intervals to form a smooth S-shaped curve.
Box plot: A diagram that summarizes the distribution of a dataset using a five-number summary.
Five-number summary: The five key values needed to draw a box plot: minimum, lower quartile, median, upper quartile, and maximum.
Lower quartile (Q₁): The value that is one-quarter of the way through an ordered dataset.
Median: The middle value of an ordered dataset, representing the central tendency.
Upper quartile (Q₃): The value that is three-quarters of the way through an ordered dataset.
Interquartile range (IQR): A measure of spread representing the middle 50% of the data, calculated by subtracting the lower quartile from the upper quartile.
Spread: A measure of how scattered or varied the data points are, typically described using the range or interquartile range.
Outlier: An extreme value that does not fit the general pattern of a dataset.
Skewness: A measure of the asymmetry of a distribution, visually indicated on a box plot by the position of the median relative to the quartiles.

Previous NoteMeasures of Spread and Outliers Next Subtopic: Population DescriptionPopulation Description

Back to Data Distributions

Put your knowledge into practice — try past paper questions for Mathematics

Key Terms(17)

Univariate data: Data that involves a single variable, such as the heights of a group of students.
Discrete data: Data that can only take specific, separate values, such as shoe sizes or the number of children in a family.
Continuous data: Data that can take any value within a range and is usually measured, such as time, mass, or height.
Histogram: A graphical representation of continuous grouped data where the area of the bar represents the frequency and there are no gaps between bars.
Frequency density: The height of a bar in a histogram, calculated by dividing the frequency by the class width.
Class interval: The range of values defining a specific group in a grouped frequency table.
Midpoint: The exact halfway value of a class interval, used for plotting frequency polygons.
Cumulative frequency: A running total of frequencies, plotted at the upper boundary of class intervals to form a smooth S-shaped curve.
Box plot: A diagram that summarizes the distribution of a dataset using a five-number summary.
Five-number summary: The five key values needed to draw a box plot: minimum, lower quartile, median, upper quartile, and maximum.
Lower quartile (Q₁): The value that is one-quarter of the way through an ordered dataset.
Median: The middle value of an ordered dataset, representing the central tendency.
Upper quartile (Q₃): The value that is three-quarters of the way through an ordered dataset.
Interquartile range (IQR): A measure of spread representing the middle 50% of the data, calculated by subtracting the lower quartile from the upper quartile.
Spread: A measure of how scattered or varied the data points are, typically described using the range or interquartile range.
Outlier: An extreme value that does not fit the general pattern of a dataset.
Skewness: A measure of the asymmetry of a distribution, visually indicated on a box plot by the position of the median relative to the quartiles.