Fiveable

⌨️AP Computer Science Principles Unit 2 Review

QR code for AP Computer Science Principles practice questions

2.2 Data Compression

⌨️AP Computer Science Principles
Unit 2 Review

2.2 Data Compression

Written by the Fiveable Content Team • Last updated September 2025
Verified for the 2026 exam
Verified for the 2026 examWritten by the Fiveable Content Team • Last updated September 2025
⌨️AP Computer Science Principles
Unit & Topic Study Guides
Pep mascot

The need for hexadecimal representations of binary values brings up an important point about digital data: there's a lot of it, and it can quickly munch into the number of bytes you have for storage (no pun intended.) It's also notoriously difficult to send files like photos or videos over text or email because of how large they are.

That's why data compression is needed. Data compression is the process (or processes) by which the size of shared or stored data is reduced. This in essence reduces the number of bits. 

The amount you can shrink your file by depends on two things:

  1. the amount of redundancy, or repeated information, that you can remove in your original data
  2. the method you use to compress your file Many data compression methods work by using symbols to shorten the data.

Data Compression Examples

A simple form of data compression is known as run-length encoding. It works by replacing repeating data, such as colors in an image or letters in a document, with a run that represents the number and value of the repeated data. For example, the string "FFFFFIIIIIIVVVVVVVEEEE" would be stored as 5F6I7V4E, greatly reducing the number of bytes needed to store it. 

Run-length encoding is used to compress some simple images, such as bitmaps. It is also used by fax machines. 

Another method of data compression that replaces repeating data with symbols is known as the LZW compression algorithm. It's used to compress text and images, most notably in GIFs.

Let's take a look at two of the most common compression types: lossless and lossy.

Pep mascot
more resources to help you study

Lossless vs. Lossy Data Compression

Lossless Data Compression

Lossless compression algorithms allow you to reduce your file size without sacrificing any of the original data in the process. You're generally able to restore your original file if you want. Run-length encoding and the LZW algorithm are both examples of lossless compression because they only shorten data to compress it, and all the information remains the same.

If your main concern is the quality of your file or if you need to be able to reconstruct your original file, lossless algorithms are usually the better option. 

This type of storage might be important in databases, where a difference in a compressed versus an uncompressed file could skew the information being represented. 

This concern about skew also applies to both medical and satellite imaging, where small differences in the data could have large impacts. Many software downloads also use lossless compression methods because the programs need to be recreated exactly on your computer in order to work.

Lossy Data Compression

In contrast, lossy compression algorithms sacrifice some data in order to achieve greater compression than you can achieve with a lossless method. They usually do this by removing details, such as replacing similar colors with the same one in a photo.

Photo by Krisztian Tabori on Unsplash

An image of tacos before and after lossy compression. Although the image on the bottom uses 62% less data than the image on the top, the two images look practically identical.

If your main concern is minimizing how big your file is or how long it'll take to send or receive it, go with a lossy method! A lot of lossy compression methods make changes that are barely detectable or even undetectable to your average viewer, and they can save you a lot of space. It's commonly used in photo, audio, and visual compression, especially for downloading purposes.

Although there are two main types of data compression, you don't have to choose just one. Indeed, many modern compression software systems use a combination of the two methods in some way.

Frequently Asked Questions

What's the difference between lossless and lossy compression?

Lossless vs. lossy is about whether you can perfectly reconstruct the original data after compression. Lossless compression (Huffman, Lempel–Ziv–Welch, run-length) reduces bits by removing redundancy but guarantees complete reconstruction—good when exact data matters (text, code, some medical images). Lossy compression (JPEG, MP3, perceptual coding) reduces bits much more by throwing away information that’s less noticeable to humans; you get an approximation of the original and may see artifacts at high compression. Key CED points: lossless = guaranteed reconstruction (EK DAT-1.D.4); lossy = approximation, smaller files, more artifacts (EK DAT-1.D.5–D.6). Choose lossless when quality/accuracy is critical; choose lossy when minimizing size or transmission time matters (EK DAT-1.D.7–D.8). For more AP-aligned review, see the Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and practice problems (https://library.fiveable.me/practice/ap-computer-science-principles).

How do I know when to use lossless vs lossy compression for the AP exam?

Use lossless when you must be able to reconstruct the original data exactly (text, source code, medical images, legal records). That aligns with EK DAT-1.D.4 and EK DAT-1.D.7—quality/reconstruction is maximally important. Use lossy when file size or transmission time matters more than a perfect copy (photos for web, streaming audio/video). EK DAT-1.D.5–D.6 explain lossy gives much greater size reduction but only an approximation and can introduce artifacts (e.g., JPEG, MP3). On the exam, pick the option that matches the tradeoff described: if the question emphasizes preserving exact information, choose lossless; if it emphasizes smallest size or faster transmission, choose lossy (EK DAT-1.D.8). Remember common examples (PNG, Huffman/LZW/RLE = lossless; JPEG, MP3, perceptual coding = lossy). Review Topic 2.2 on Fiveable for quick reminders (study guide: https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and practice questions (https://library.fiveable.me/practice/ap-computer-science-principles).

I'm confused about how compression can reduce file size but still keep all the information - how does that work?

Compression works because files usually contain redundancy—repeated or predictable patterns—so you can represent those patterns with fewer bits without losing the actual information. Lossless algorithms (like run-length encoding, Huffman, or Lempel–Ziv–Welch) replace repeated or probable patterns with shorter codes and store a dictionary or codebook so the exact original bits can be reconstructed later (CED EK DAT-1.D.4, EK DAT-1.D.3). Lossy algorithms (like JPEG or MP3) go further by removing information that people rarely notice (perceptual coding), so you get much smaller files but only an approximation of the original (CED EK DAT-1.D.5–D.6). Which to use depends on the goal: if perfect reconstruction matters, pick lossless; if minimizing size or bitrate matters and small quality loss is acceptable, pick lossy (CED EK DAT-1.D.7–D.8). For AP review, check the Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and practice problems (https://library.fiveable.me/practice/ap-computer-science-principles).

What's the syntax for calculating compression ratio in a programming problem?

Compression ratio is just a math formula you implement in code. Two common forms: - Ratio = original_size / compressed_size (e.g., 10 MB / 2 MB = 5 → “5:1”) - Percent reduction = (1 - compressed_size / original_size) * 100 (e.g., (1 - 2/10)*100 = 80% smaller) Make sure sizes use the same units (bits or bytes). If you need integers, cast to float to avoid integer division. Examples: - Python: ratio = original_size / compressed_size percent = (1 - compressed_size / original_size) * 100 - Java: double ratio = (double) originalSize / compressedSize; double percent = (1.0 - (double) compressedSize / originalSize) * 100.0; This aligns with LO DAT-1.D (compare algorithms using size reduction). For AP-style problems, report the ratio (like 5:1) or percent reduction and state units. For more review, see the Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and practice problems (https://library.fiveable.me/practice/ap-computer-science-principles).

Why would anyone choose lossy compression if it loses data?

Lossy compression is chosen because the benefits (much smaller files and faster transmission) often outweigh the cost of imperfect reconstruction. Perceptual coding (used in JPEG and MP3) removes bits people don’t notice—tiny color shifts or sounds outside human hearing—so files can be “multiple times” smaller than lossless. That matters for streaming, web images, mobile storage, and real-time video where bandwidth or storage is limited. AP CED alignment: EK DAT-1.D.5–.6 explain lossy gives big size reductions but only approximate the original; EK DAT-1.D.7–.8 say use lossless when exact reconstruction matters and lossy when minimizing size/time matters. Tradeoffs include artifacts (visible blockiness or audio distortion) and lower bitrate; choose based on acceptable quality for the task. For more on comparing algorithms and examples, see the Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and practice questions (https://library.fiveable.me/practice/ap-computer-science-principles).

Can someone explain what redundancy means in data compression with a simple example?

Redundancy means “extra” or repeated bits in data that don’t add new information and can be removed or encoded more efficiently. Simple example: the text “AAAAABBBBCC” has lots of repeats. Run-length encoding (a lossless method) can compress it to “5A4B2C”—fewer bits but you can fully reconstruct the original. High redundancy → bigger compression ratio (more size reduction). Low redundancy (like random data) gives little or no compression because there’s less repeatable pattern; Shannon entropy measures that. For AP CSP, remember EK DAT-1.D.3: how much you can compress depends on redundancy and the algorithm used, and EK DAT-1.D.4/5: lossless vs. lossy tradeoffs. If you want more examples and practice problems for Topic 2.2, check the Fiveable study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and the unit page (https://library.fiveable.me/ap-computer-science-principles/unit-2).

How do I write code to compare two compression algorithms?

Pick what you want to compare (e.g., Huffman vs. LZW or RLE vs. LZW) and follow a repeatable experiment plan. 1. Datasets: choose varied inputs—highly redundant text, low-redundancy text, simple images, photos, audio—so you test EK DAT-1.D.3. 2. Implement or use libraries for both algorithms (or pseudocode if required by the AP task). Make sure lossless algorithms can reconstruct exactly (EK DAT-1.D.4). 3. Measurements to record per file: original size (bits), compressed size (bits), compression ratio = original/compressed, runtime (ms) for compression/decompression, and for lossy methods measure fidelity (PSNR or visual/aural artifacts) per EK DAT-1.D.5–D.6. 4. Repeat runs and average times; test reconstruction equality for lossless. 5. Summarize tradeoffs: size vs. fidelity vs. time, and pick best algorithm per context (quality-critical → lossless; size-critical → lossy) per EK DAT-1.D.7–D.8. 6. For AP-style reporting, include algorithms used, datasets, metrics, sample code snippets, and test results (graphs/tables) to justify your conclusion. For step-by-step study and examples, see the Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and practice problems (https://library.fiveable.me/practice/ap-computer-science-principles). Fiveable’s study guides and practice questions are useful for exam-style framing.

What are some real examples of lossless compression algorithms I should know for the test?

Good question—for the AP exam you should know several common lossless algorithms and where they’re used: - Run-Length Encoding (RLE)—simple, good for repeated-symbol data (like simple bitmaps). - Huffman coding—variable-length prefix codes based on symbol frequency (important for entropy/redundancy ideas). - Arithmetic coding—like Huffman but can get closer to Shannon entropy limits. - Lempel-Ziv family (LZ77, LZ78) and LZW—dictionary-based, used in GIF, compress streaming data well. - DEFLATE—combines LZ77 + Huffman (used in ZIP and PNG), a practical example to cite. - Burrows–Wheeler Transform (BWT)—used inside bzip2 as a block transform before other lossless steps. On the exam, be ready to compare lossless vs. lossy (EK DAT-1.D.4–D.6) and pick lossless when exact reconstruction matters. For a quick review, check the Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and practice problems (https://library.fiveable.me/practice/ap-computer-science-principles).

I don't understand how fewer bits can still mean the same amount of information - isn't that contradictory?

It’s not contradictory—“fewer bits” can still carry the same information because compression removes redundancy or changes representation, not the actual meaning. Lossless algorithms (like Huffman, LZW, or run-length) find patterns and replace repeated or predictable parts with shorter codes so you can reconstruct the original exactly (EK DAT-1.D.4, EK DAT-1.D.2). Lossy compression (JPEG, MP3) removes information that’s perceptually unimportant, so files shrink more but you only get an approximation back (EK DAT-1.D.5–D.6). Think of texting: “see you later” → “cu l8r”—fewer characters, same message because common patterns and sounds are encoded more efficiently. The amount you can shrink data depends on redundancy and the algorithm (EK DAT-1.D.3). For AP exam focus, compare lossless vs. lossy tradeoffs (EK DAT-1.D.7–D.8) and know examples/terms like entropy, compression ratio, and bitrate. Review the Topic 2.2 study guide on Fiveable (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and try practice problems (https://library.fiveable.me/practice/ap-computer-science-principles).

When would you choose lossy over lossless compression in an AP exam question?

Choose lossy when minimizing file size or transmission time is the main goal and an approximate reconstruction is acceptable—typically for photos, audio, or video where perceptual coding removes details people won’t notice (think JPEG, MP3). CED cues: EK DAT-1.D.6 and EK DAT-1.D.8 say lossy usually yields much greater size reduction; EK DAT-1.D.5 warns you only get an approximation and may introduce artifacts. Choose lossless when you must perfectly reconstruct the original data (text, executable files, some scientific images) per EK DAT-1.D.4 and EK DAT-1.D.7. On the AP exam (LO DAT-1.D), justify your choice by comparing compression ratio, bitrate, and acceptable quality trade-offs (artifacts vs. storage/transmission gains). Use keywords like perceptual coding, redundancy, and artifacts in your explanation. For a quick refresher, check the Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and practice questions (https://library.fiveable.me/practice/ap-computer-science-principles).

What's the difference between compression ratio and compression rate?

Compression ratio and compression rate are different measures: - Compression ratio describes how much smaller data becomes. It’s usually given as original_size : compressed_size (or sometimes as compressed_size/original_size). Example: a 10 MB file compressed to 2 MB has a ratio of 10:2 (often reported as 5:1)—you’ve reduced size by 5×. This links directly to redundancy and algorithm choice (lossless vs. lossy) in the CED (EK DAT-1.D.1–D.6). - Compression rate refers to speed or the data rate after/while compressing. It can mean (a) how fast an algorithm compresses/decompresses (e.g., MB/s) or (b) the resulting bitrate for streaming media (e.g., kbps). In AP terms, think of “bitrate” when discussing lossy formats like MP3 or JPEG (keywords in the CED). On the exam, compare algorithms using both: ratio for storage savings and rate for transmission/time trade-offs (LO DAT-1.D). For a quick review, see the Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and Unit 2 overview (https://library.fiveable.me/ap-computer-science-principles/unit-2). For extra practice, check Fiveable’s practice problems (https://library.fiveable.me/practice/ap-computer-science-principles).

How do I trace through a simple compression algorithm step by step?

Pick a simple lossless algorithm like run-length encoding (RLE) and walk every step with the data in front of you. Example: input = AAABBBCC 1. Start at first char: A. Count repeats: 3 As → emit "3A". 2. Move to next new char: B. Count repeats: 3 Bs → emit "3B". 3. Next: C. Count repeats: 2 Cs → emit "2C". 4. End. Encoded output = 3A3B2C. When you trace, write the current symbol, running count, and output after each change. That shows how redundancy (long runs) makes RLE effective (fewer symbols → smaller size). Compare by computing compression ratio or bits before/after. For AP, practice tracing both lossless (RLE, Huffman, LZW) and lossy examples, and explain tradeoffs (EK DAT-1.D.3–DAT-1.D.6). For more examples and practice problems, check the Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and thousands of practice questions (https://library.fiveable.me/practice/ap-computer-science-principles).

Why does my compression algorithm sometimes make files bigger instead of smaller?

Short answer: your compressor can make files bigger when the data has little redundancy (high entropy) or when the compression method adds more overhead than it removes. Why: lossless algorithms (Huffman, LZW, run-length encoding) exploit redundancy—repeated patterns or predictable symbols—to cut bits (EK DAT-1.D.3, DAT-1.D.4). If the input is already random or already compressed (JPEG, MP3, PNG), there’s no redundancy to remove, so added headers, dictionaries or codebooks make the file larger. Also small files can grow because fixed metadata and block/dictionary startup costs dominate. Lossy algorithms can shrink more, but they change data (EK DAT-1.D.5–D.7) and aren’t an option when exact reconstruction is required. For AP review, remember to compare algorithms by redundancy, overhead, and whether lossless vs lossy is acceptable (CED Topic 2.2). More practice: see the Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and extra problems (https://library.fiveable.me/practice/ap-computer-science-principles).

What are the trade-offs between lossless and lossy compression that I need to memorize?

Memorize these core trade-offs so you can pick the right algorithm on the exam (LO DAT-1.D): - Guarantee vs. size: Lossless = perfect reconstruction (EK DAT-1.D.4) but smaller reductions; lossy = only an approximation (EK DAT-1.D.5) and usually much higher compression (EK DAT-1.D.6). - Quality vs. bandwidth/storage: Lossless keeps full quality (choose when reconstruction is crucial: text, code, legal/medical files—EK DAT-1.D.7). Lossy sacrifices some quality for much smaller files and faster transmission (good for photos, audio, video—EK DAT-1.D.8). - Artifacts vs. fidelity: Lossy can introduce perceptual artifacts (JPEG, MP3) and may remove data you can’t get back; lossless (PNG, ZIP) preserves entropy and redundancy so you can fully restore original bits. - Choose based on redundancy and algorithm type (dictionary-based, RLE, Huffman, perceptual coding) and required compression ratio/bitrate (EK DAT-1.D.3). Review Fiveable’s Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and practice questions (https://library.fiveable.me/practice/ap-computer-science-principles) to drill examples.

Can you give me examples of when quality vs file size matters for choosing compression types?

Choose lossless when you must reconstruct exact original data; choose lossy when smaller size or faster transmission matters and small errors are acceptable. Examples: - Photos for professional printing or medical images → lossless (PNG, TIFF, or raw): quality is critical, no artifacts allowed (EK DAT-1.D.4, EK DAT-1.D.7). - Web photos, social media, thumbnails → lossy (JPEG): big size reduction with perceptual coding; some artifacts are okay to speed page loads (EK DAT-1.D.5, EK DAT-1.D.8). - Music masters or archival audio → lossless (FLAC, WAV): exact audio needed. - Streaming music/podcasts → lossy (MP3, AAC): much smaller bitrate, acceptable approximation for listeners. - Text, code, spreadsheets → lossless (ZIP, LZW): any change breaks usability. On the AP exam you may be asked to compare algorithms by trade-offs (compression ratio, redundancy, artifacts). For more review see the Topic 2.2 study guide (https://library.fiveable.me/ap-computer-science-principles/unit-2/data-compression/study-guide/21yLa92Ec2potY7nGQfu) and practice problems (https://library.fiveable.me/practice/ap-computer-science-principles).