Fiveable

๐ŸคŸ๐ŸผNatural Language Processing Unit 2 Review

QR code for Natural Language Processing practice questions

2.1 N-grams and language models

๐ŸคŸ๐ŸผNatural Language Processing
Unit 2 Review

2.1 N-grams and language models

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐ŸคŸ๐ŸผNatural Language Processing
Unit & Topic Study Guides

N-grams are the building blocks of language models, helping predict word sequences. They capture local context by analyzing contiguous word or character groups, with higher-order n-grams providing more context but requiring more data.

Language models use n-grams to estimate word probabilities based on previous words. This approach simplifies modeling by assuming a word's likelihood depends only on its recent history, making it computationally feasible for various NLP tasks.

N-grams for Language Modeling

Understanding N-grams

  • N-grams represent contiguous sequences of n items (words or characters) extracted from a text corpus
    • n is a positive integer that determines the length of the sequence
    • Unigrams (1-grams) represent single items (words)
    • Bigrams (2-grams) represent pairs of items (phrase)
    • Trigrams (3-grams) represent triples of items (sentence fragment)
  • N-grams capture the local context and statistical dependencies between words or characters in a language
    • They provide information about the likelihood of words or characters appearing together
    • Higher-order n-grams (larger n) capture more context but require more training data

Role of N-grams in Language Modeling

  • Language models assign probabilities to sequences of words or characters
    • They enable the prediction of the likelihood of a given sequence occurring in a language
    • Language models are used in various NLP tasks (machine translation, speech recognition, text generation)
  • N-gram language models estimate the probability of a word or character given the previous n-1 words or characters
    • The estimation is based on the frequency counts of n-grams in the training corpus
    • The Markov assumption simplifies language modeling by assuming that the probability of a word depends only on the previous n-1 words, rather than the entire history
    • This assumption reduces the complexity of the model and makes it computationally tractable

Training N-gram Models

Counting and Probability Estimation

  • Training an n-gram language model involves counting the frequencies of n-grams in a text corpus
    • The counts are normalized to obtain probability estimates
    • The maximum likelihood estimation (MLE) is a simple method for estimating n-gram probabilities
      • MLE divides the count of an n-gram by the count of its prefix (n-1)-gram
      • Example: $P(w_n | w_{n-1}) = \frac{count(w_{n-1}, w_n)}{count(w_{n-1})}$
  • Smoothing techniques address the sparsity problem in n-gram models
    • Some n-grams may have zero or low counts in the training data, leading to unreliable probability estimates
    • Add-one smoothing (Laplace smoothing) adds a small constant (usually 1) to the count of each n-gram to avoid zero probabilities
    • Good-Turing smoothing adjusts the counts of low-frequency n-grams based on the count of n-grams with one higher frequency
    • Kneser-Ney smoothing considers the contexts in which n-grams occur and assigns higher probabilities to n-grams that appear in diverse contexts

Handling Unseen N-grams

  • Backoff and interpolation methods combine probabilities from lower-order n-gram models when higher-order n-grams are not available or have low counts
    • Backoff methods recursively fall back to lower-order models until a non-zero probability is found
    • Interpolation methods linearly combine the probabilities from different order models
  • Smoothing techniques also help in handling unseen n-grams by redistributing probability mass from seen n-grams to unseen ones
    • This allows the model to assign non-zero probabilities to unseen sequences
    • Smoothing helps in generalization and prevents the model from overfitting to the training data

Applying N-gram Models

Text Generation

  • N-gram language models can be used to generate text by iteratively predicting the next word or character based on the previous n-1 words or characters
    • The generation process starts with a seed or prompt and continues by sampling words or characters from the conditional probability distribution given by the n-gram model
    • The generated text follows the statistical patterns learned from the training corpus
    • The quality of the generated text depends on the order of the n-gram model and the diversity of the training data

Sequence Probability Estimation

  • The probability of a sequence of words or characters can be estimated by multiplying the conditional probabilities of each word or character given its preceding context
    • The chain rule of probability allows the joint probability of a sequence to be decomposed into a product of conditional probabilities
    • Example: $P(w_1, w_2, ..., w_n) = P(w_1) \times P(w_2 | w_1) \times ... \times P(w_n | w_{n-1}, ..., w_1)$
  • N-gram models can be used to estimate the likelihood of a given sentence or to compare the probabilities of different sentences
    • This is useful in tasks like language identification, spell checking, and speech recognition
    • The probability estimates can also be used as features in other NLP models

Evaluating N-gram Model Performance

Perplexity

  • Perplexity is a commonly used metric to evaluate the quality of a language model in terms of its ability to predict sequences
    • Perplexity measures how surprised or confused a language model is by a given test set, with lower perplexity indicating better performance
    • It is calculated as the exponential of the average negative log-likelihood of a test set under the language model
  • To compute perplexity, the test set is divided into n-grams, and the log-probability of each n-gram is calculated using the trained language model
    • The average negative log-likelihood is then computed by summing the log-probabilities and dividing by the total number of words or characters in the test set
    • Perplexity provides an intuitive measure of the language model's uncertainty, with lower perplexity indicating better predictive power

Other Evaluation Metrics

  • Cross-entropy measures the average number of bits needed to encode each word or character using the model's probability distribution
    • Lower cross-entropy indicates better performance and a closer match between the model's predictions and the actual data
  • The quality of generated text can be assessed through human evaluation
    • Metrics like coherence, fluency, and diversity of the generated samples can be rated by human annotators
    • Human evaluation provides a subjective assessment of the model's performance in generating meaningful and coherent text

Limitations of N-gram Models

  • N-gram language models have limitations in capturing long-range dependencies
    • They rely on the Markov assumption and consider only the local context of n-1 words or characters
    • Long-range dependencies and global coherence are not effectively captured by n-gram models
  • The number of parameters in an n-gram model grows exponentially with increasing n
    • Higher-order models require a large amount of training data to estimate the probabilities reliably
    • The sparsity problem becomes more severe for higher-order models, as many n-grams may not be observed in the training data