Data compression reduces the number of bits needed to store or transmit a file. There are two fundamentally different approaches — lossy compression permanently discards some data for a smaller file, while lossless compression shrinks the file without losing a single bit, allowing perfect reconstruction.
Why does data compression matter?
Without compression, modern digital life would grind to a halt. A single uncompressed HD video frame requires roughly 6 MB of storage; a one-hour film at 24 frames per second would need more than 500 GB. Compression brings that down to a few gigabytes for streaming. The GCSE Computer Science specifications (AQA, OCR, Edexcel) all require students to explain compression methods, calculate file sizes before and after compression, and evaluate which method suits a given context.
What is lossless compression?
Lossless compression reduces file size using patterns and redundancy in the data, without permanently removing any information. When the file is decompressed, it is bit-for-bit identical to the original.
Run-length encoding (RLE)
RLE replaces consecutive repeated values with a count and the value itself.
Worked example — compress the pixel row: WWWWWWBBBBWWB
Instead of storing all 13 characters, RLE stores: 6W 4B 2W 1B
| Original | Compressed |
|---|---|
| WWWWWWBBBBWWB | 6W 4B 2W 1B |
| 13 characters | 8 characters (saves ~38%) |
RLE is highly effective for images with large blocks of identical colour — for example, simple diagrams, logos, or screenshots with flat backgrounds. It is less effective for photographs, where pixel values change constantly.
Huffman coding
Huffman coding assigns shorter binary codes to frequently occurring characters and longer codes to rare characters, reducing the total number of bits needed.
Worked example — encode the string AABACABA using Huffman coding:
First, count frequencies:
| Character | Frequency |
|---|---|
| A | 5 |
| B | 2 |
| C | 1 |
A Huffman tree assigns codes by merging the two lowest-frequency items repeatedly:
- Merge C (1) and B (2) → node (3)
- Merge node (3) and A (5) → root (8)
Resulting codes: A = 0, B = 10, C = 11
| Character | Fixed 2-bit code | Huffman code |
|---|---|---|
| A | 00 | 0 (1 bit) |
| B | 01 | 10 (2 bits) |
| C | 10 | 11 (2 bits) |
Encoding AABACABA with fixed 2-bit codes: 8 × 2 = 16 bits
Encoding with Huffman: 5×1 + 2×2 + 1×2 = 5 + 4 + 2 = 11 bits — a saving of 31%.
Huffman coding is used in ZIP archives, PNG images, and is part of the MP3 and JPEG standards.
What is lossy compression?
Lossy compression permanently removes data judged to be imperceptible or less important. The decompressed file is not identical to the original — some quality is sacrificed for a much smaller file size.
How lossy compression works in JPEG images
JPEG (Joint Photographic Experts Group) compression:
- Splits the image into 8×8 pixel blocks.
- Converts colour information, keeping brightness detail (the human eye is more sensitive to brightness than colour).
- Uses a mathematical transform (Discrete Cosine Transform) to identify high-frequency detail.
- Discards high-frequency fine detail at a level controlled by a "quality factor."
At high JPEG quality (90%), the human eye cannot usually detect the difference from the original. At low quality (20–40%), obvious blocky artefacts ("compression artefacts") appear.
How lossy compression works in MP3 audio
MP3 (MPEG Audio Layer III) removes sounds that human hearing is least sensitive to:
- Very high and very low frequencies beyond normal hearing range.
- Quiet sounds that occur at the same time as loud sounds (the loud sound "masks" the quiet one).
A high-quality MP3 (320 kbps) is nearly indistinguishable from a CD for most listeners; a low-quality MP3 (128 kbps) sounds noticeably thinner.
Lossy vs lossless: which to use?
| Scenario | Best method | Why |
|---|---|---|
| Medical scan (X-ray, MRI) | Lossless | Any data loss could affect diagnosis |
| Text document or spreadsheet | Lossless | Even one wrong character could corrupt meaning |
| Photograph for web display | Lossy (JPEG) | Small quality loss is invisible; file size saving is large |
| Music streaming | Lossy (MP3/AAC) | Removed frequencies are inaudible; bandwidth saving is substantial |
| Simple diagram or logo | Lossless (PNG) | Flat colour areas compress well losslessly; lossy would blur edges |
| Video for streaming | Lossy (H.264/HEVC) | Lossless video at HD resolution is impractical in size |
How do you calculate compression ratio?
The compression ratio compares original size to compressed size:
Compression ratio = original size ÷ compressed size
Example: A bitmap image is 2,400 KB. After JPEG compression it is 240 KB. Compression ratio = 2,400 ÷ 240 = 10:1
The space saving is: (1 − 240/2,400) × 100 = 90%
Frequently asked questions
Can you decompress a lossy file to get the original back?
No. By definition, lossy compression permanently discards data. Once a JPEG has been saved at low quality, the discarded pixel information is gone. This is why professionals keep original RAW or lossless files and only export compressed versions for sharing. Repeatedly saving a JPEG at low quality (opening it, editing, resaving) degrades quality further each time.
What is the difference between PNG and JPEG?
PNG uses lossless compression; JPEG uses lossy compression. PNG is better for images with sharp edges, text, logos, or flat colour blocks because compression artefacts from JPEG blur these details. JPEG produces much smaller files for photographs where smooth colour gradients dominate. PNG also supports a transparent background channel (alpha channel); JPEG does not.
Is a ZIP file an example of lossless or lossy compression?
ZIP is lossless. It uses algorithms similar to Huffman coding and LZ77 (a dictionary-based method) to compress files without losing any data. When you unzip a ZIP archive, every file inside is identical to the original. This is why ZIP is appropriate for compressing executable programs, documents, or spreadsheets — corrupting even a single bit in a program makes it unusable.
Why does repeatedly saving a JPEG make it worse?
Each time a JPEG is saved, the compression algorithm runs again and discards a new set of fine details from whatever data remains. After many cycles of open-edit-save, the cumulative data loss becomes clearly visible as blocky, blurry artefacts. This is called "generation loss." It is why image editors work on lossless formats (PSD, TIFF, PNG) internally and only export to JPEG as a final step.
For Socratic GCSE Computer Science tutoring on data representation, compression, and more, visit aitutors.me.