Data compression reduces the number of bits needed to store or transmit a file. There are two fundamentally different approaches — lossy compression permanently discards some data for a smaller file, while lossless compression shrinks the file without losing a single bit, allowing perfect reconstruction.

Why does data compression matter?

Without compression, modern digital life would grind to a halt. A single uncompressed HD video frame requires roughly 6 MB of storage; a one-hour film at 24 frames per second would need more than 500 GB. Compression brings that down to a few gigabytes for streaming. The GCSE Computer Science specifications (AQA, OCR, Edexcel) all require students to explain compression methods, calculate file sizes before and after compression, and evaluate which method suits a given context.

What is lossless compression?

Lossless compression reduces file size using patterns and redundancy in the data, without permanently removing any information. When the file is decompressed, it is bit-for-bit identical to the original.

Run-length encoding (RLE)

RLE replaces consecutive repeated values with a count and the value itself.

Worked example — compress the pixel row: WWWWWWBBBBWWB

Instead of storing all 13 characters, RLE stores: 6W 4B 2W 1B

Original Compressed
WWWWWWBBBBWWB 6W 4B 2W 1B
13 characters 8 characters (saves ~38%)

RLE is highly effective for images with large blocks of identical colour — for example, simple diagrams, logos, or screenshots with flat backgrounds. It is less effective for photographs, where pixel values change constantly.

Huffman coding

Huffman coding assigns shorter binary codes to frequently occurring characters and longer codes to rare characters, reducing the total number of bits needed.

Worked example — encode the string AABACABA using Huffman coding:

First, count frequencies:

Character Frequency
A 5
B 2
C 1

A Huffman tree assigns codes by merging the two lowest-frequency items repeatedly:

  • Merge C (1) and B (2) → node (3)
  • Merge node (3) and A (5) → root (8)

Resulting codes: A = 0, B = 10, C = 11

Character Fixed 2-bit code Huffman code
A 00 0 (1 bit)
B 01 10 (2 bits)
C 10 11 (2 bits)

Encoding AABACABA with fixed 2-bit codes: 8 × 2 = 16 bits Encoding with Huffman: 5×1 + 2×2 + 1×2 = 5 + 4 + 2 = 11 bits — a saving of 31%.

Huffman coding is used in ZIP archives, PNG images, and is part of the MP3 and JPEG standards.

What is lossy compression?

Lossy compression permanently removes data judged to be imperceptible or less important. The decompressed file is not identical to the original — some quality is sacrificed for a much smaller file size.

How lossy compression works in JPEG images

JPEG (Joint Photographic Experts Group) compression:

  1. Splits the image into 8×8 pixel blocks.
  2. Converts colour information, keeping brightness detail (the human eye is more sensitive to brightness than colour).
  3. Uses a mathematical transform (Discrete Cosine Transform) to identify high-frequency detail.
  4. Discards high-frequency fine detail at a level controlled by a "quality factor."

At high JPEG quality (90%), the human eye cannot usually detect the difference from the original. At low quality (20–40%), obvious blocky artefacts ("compression artefacts") appear.

How lossy compression works in MP3 audio

MP3 (MPEG Audio Layer III) removes sounds that human hearing is least sensitive to:

  • Very high and very low frequencies beyond normal hearing range.
  • Quiet sounds that occur at the same time as loud sounds (the loud sound "masks" the quiet one).

A high-quality MP3 (320 kbps) is nearly indistinguishable from a CD for most listeners; a low-quality MP3 (128 kbps) sounds noticeably thinner.

Lossy vs lossless: which to use?

Scenario Best method Why
Medical scan (X-ray, MRI) Lossless Any data loss could affect diagnosis
Text document or spreadsheet Lossless Even one wrong character could corrupt meaning
Photograph for web display Lossy (JPEG) Small quality loss is invisible; file size saving is large
Music streaming Lossy (MP3/AAC) Removed frequencies are inaudible; bandwidth saving is substantial
Simple diagram or logo Lossless (PNG) Flat colour areas compress well losslessly; lossy would blur edges
Video for streaming Lossy (H.264/HEVC) Lossless video at HD resolution is impractical in size

How do you calculate compression ratio?

The compression ratio compares original size to compressed size:

Compression ratio = original size ÷ compressed size

Example: A bitmap image is 2,400 KB. After JPEG compression it is 240 KB. Compression ratio = 2,400 ÷ 240 = 10:1

The space saving is: (1 − 240/2,400) × 100 = 90%

Frequently asked questions

Can you decompress a lossy file to get the original back?

No. By definition, lossy compression permanently discards data. Once a JPEG has been saved at low quality, the discarded pixel information is gone. This is why professionals keep original RAW or lossless files and only export compressed versions for sharing. Repeatedly saving a JPEG at low quality (opening it, editing, resaving) degrades quality further each time.

What is the difference between PNG and JPEG?

PNG uses lossless compression; JPEG uses lossy compression. PNG is better for images with sharp edges, text, logos, or flat colour blocks because compression artefacts from JPEG blur these details. JPEG produces much smaller files for photographs where smooth colour gradients dominate. PNG also supports a transparent background channel (alpha channel); JPEG does not.

Is a ZIP file an example of lossless or lossy compression?

ZIP is lossless. It uses algorithms similar to Huffman coding and LZ77 (a dictionary-based method) to compress files without losing any data. When you unzip a ZIP archive, every file inside is identical to the original. This is why ZIP is appropriate for compressing executable programs, documents, or spreadsheets — corrupting even a single bit in a program makes it unusable.

Why does repeatedly saving a JPEG make it worse?

Each time a JPEG is saved, the compression algorithm runs again and discards a new set of fine details from whatever data remains. After many cycles of open-edit-save, the cumulative data loss becomes clearly visible as blocky, blurry artefacts. This is called "generation loss." It is why image editors work on lossless formats (PSD, TIFF, PNG) internally and only export to JPEG as a final step.


For Socratic GCSE Computer Science tutoring on data representation, compression, and more, visit aitutors.me.