Fiveable

๐Ÿ’ปComputational Biology Unit 6 Review

QR code for Computational Biology practice questions

6.2 Protein sequence analysis and motif discovery

๐Ÿ’ปComputational Biology
Unit 6 Review

6.2 Protein sequence analysis and motif discovery

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿ’ปComputational Biology
Unit & Topic Study Guides

Protein sequence analysis and motif discovery are crucial tools in proteomics. They help scientists uncover hidden patterns and functional regions within proteins, shedding light on their roles and relationships. These techniques are essential for understanding protein evolution, function, and interactions.

By analyzing sequences and identifying motifs, researchers can predict protein structures, functions, and potential modifications. This knowledge is vital for drug discovery, understanding disease mechanisms, and developing targeted therapies. It's a cornerstone of modern proteomics and protein analysis.

Sequence Alignment and Homology Searching

Pairwise and Multiple Sequence Alignment

  • Sequence alignment identifies regions of similarity between two or more protein sequences, indicating potential functional, structural, or evolutionary relationships
  • Pairwise sequence alignment compares two sequences at a time, finding the best-matching alignment between them
    • Needleman-Wunsch algorithm performs global alignment, aligning entire sequences from end to end
    • Smith-Waterman algorithm performs local alignment, identifying the best-matching subsequences within the sequences
  • Multiple sequence alignment (MSA) simultaneously aligns three or more sequences
    • Identifies conserved regions, motifs, and evolutionary relationships across a set of related proteins
    • Popular MSA algorithms include ClustalW, MUSCLE, and T-Coffee, each with different strategies for optimizing the alignment

Homology Searching with BLAST

  • Homology searching finds sequences in a database that are related to a query sequence
    • Helps identify potential homologs (sequences with shared ancestry), orthologs (sequences in different species that evolved from a common ancestor), and paralogs (sequences within the same species that arose by gene duplication)
  • BLAST (Basic Local Alignment Search Tool) is a widely used algorithm for homology searching
    • Compares a query sequence against a database of sequences, returning statistically significant matches based on sequence similarity
    • Breaks the query sequence into shorter segments (words), finds exact matches in the database, and extends the matches to optimize the alignment
  • Interpreting BLAST results involves understanding key parameters:
    • E-value: the expected number of matches that would occur by chance in a database of the given size (lower E-values indicate more significant matches)
    • Bit score: a measure of alignment quality that takes into account the alignment length and the scoring matrix used (higher bit scores indicate better alignments)
    • Percent identity: the proportion of identical residues in the aligned region (higher percent identity suggests closer evolutionary relationships)

Conserved Domains and Motifs in Proteins

Identifying Conserved Domains

  • Conserved domains are regions of a protein sequence that have remained relatively unchanged throughout evolution, often due to their functional or structural importance
  • Domain databases contain curated alignments and models of known protein domains, facilitating their identification in query sequences
    • Pfam: a comprehensive database of protein families and domains, based on hidden Markov models (HMMs)
    • SMART: a database of signaling and extracellular domains, as well as repeats and low-complexity regions
    • CDD: the Conserved Domain Database, integrating domain information from multiple sources, including Pfam and SMART
  • Sequence logos visualize conservation patterns in a multiple sequence alignment
    • Display the relative frequency and conservation of amino acids at each position in the alignment
    • Help identify conserved regions and potential functional sites (catalytic residues, binding sites, or structurally important positions)

Short Linear Motifs and Regular Expressions

  • Short linear motifs (SLiMs) are short, conserved patterns of amino acids that often serve as recognition sites for protein-protein interactions, post-translational modifications, or subcellular localization signals
    • Typically 3-10 amino acids in length, with a few highly conserved positions that define the motif's function
    • Examples include the PDZ-binding motif (S/T-X-V/L), the nuclear localization signal (K-K/R-X-K/R), and the SUMO-interacting motif (V/I-X-V/I-V/I)
  • Regular expressions define and search for specific sequence patterns or motifs in proteins
    • Use wildcards, position-specific constraints, and repetitions to describe a motif of interest
    • Example: [RK]-X-[ST]-X-[LIVMF] describes a potential phosphorylation site, where X represents any amino acid
  • Motif databases catalog known short linear motifs and their associated functions
    • ELM (Eukaryotic Linear Motif): a database of experimentally validated motifs, organized by functional categories (ligand-binding, targeting, cleavage, etc.)
    • MiniMotif: a database of minimotifs (short functional motifs), including their sequence patterns, binding partners, and biological roles

Predicting Post-Translational Modifications

Common Types of Post-Translational Modifications

  • Post-translational modifications (PTMs) are chemical modifications of amino acids that occur after protein synthesis, often regulating protein function, localization, and interactions
  • Phosphorylation: the addition of a phosphate group to serine, threonine, or tyrosine residues
    • Regulates protein activity, protein-protein interactions, and signaling cascades
    • Prediction tools: NetPhos (predicts phosphorylation sites using neural networks), GPS (Group-based Prediction System, uses a hierarchical algorithm to predict kinase-specific phosphorylation sites)
  • Glycosylation: the attachment of carbohydrate molecules to specific amino acids
    • Affects protein folding, stability, and recognition
    • N-linked glycosylation occurs on asparagine residues in the sequence context Asn-X-Ser/Thr (where X is any amino acid except proline)
    • O-linked glycosylation occurs on serine or threonine residues, with no specific sequence motif
    • Prediction tools: NetNGlyc (predicts N-glycosylation sites), NetOGlyc (predicts O-glycosylation sites)
  • Ubiquitination: the covalent attachment of ubiquitin (a small regulatory protein) to lysine residues
    • Targets proteins for degradation by the proteasome, regulates protein turnover and signaling
    • Prediction tools: UbPred (predicts ubiquitination sites using a random forest classifier), BDM-PUB (Bidirectional Motif-Based Prediction of Ubiquitination Sites, uses sequence motifs and structural information)
  • Methylation and acetylation: PTMs that can occur on lysine and arginine residues
    • Regulate gene expression, protein-protein interactions, and protein stability
    • Prediction tools: GPS-MSP (predicts methylation sites), ASEB (predicts acetylation sites using support vector machines)

Functional Site Prediction

  • Functional site prediction tools aim to identify regions of a protein that are important for its function, such as catalytic sites, ligand-binding sites, or protein-protein interaction interfaces
  • Consurf: a tool that uses evolutionary conservation to predict functionally important regions in a protein
    • Aligns the query sequence with homologs from different species and calculates a conservation score for each position
    • Highly conserved positions are likely to be functionally important, while variable positions are less likely to be essential
  • FunFOLD: a tool that combines sequence conservation and structural information to predict functional sites
    • Uses a template-based approach, comparing the query protein to structurally similar proteins with known functional sites
    • Transfers the functional site information from the templates to the query protein, based on sequence and structural similarity

De Novo Motif Discovery in Proteins

Principles and Algorithms for De Novo Motif Discovery

  • De novo motif discovery identifies novel, overrepresented sequence patterns in a set of proteins without relying on prior knowledge of known motifs
  • Motif discovery algorithms search for patterns that occur more frequently in a set of sequences than would be expected by chance, often using statistical measures to assess significance
  • MEME (Multiple Em for Motif Elicitation) suite: a widely used tool for de novo motif discovery
    • Uses an expectation-maximization algorithm to identify overrepresented sequence patterns in a set of proteins
    • Iteratively refines the motif model by optimizing the likelihood of the data given the motif
  • Gibbs sampling: a probabilistic approach for de novo motif discovery
    • Uses a sampling method to identify overrepresented patterns by iteratively updating the motif model and the alignment of sequences to the motif
    • Tools like AlignACE and GibbsCluster implement this approach

Challenges and Evaluation of De Novo Motif Discovery

  • Challenges in de novo motif discovery include:
    • Distinguishing true motifs from random patterns that occur by chance
    • Determining the appropriate motif length and number of occurrences in the dataset
    • Assessing the biological significance of discovered motifs
  • Evaluating the quality of discovered motifs often involves measures such as:
    • Information content (IC): quantifies the conservation of a motif by measuring the difference between the observed frequency of amino acids at each position and the background frequency
    • E-value: assesses the statistical significance of a motif's occurrence in the dataset, representing the expected number of motifs with a similar score that would occur by chance
  • Experimental validation is crucial to confirm the biological relevance of discovered motifs
    • Site-directed mutagenesis: introduces specific mutations in the motif to assess its functional importance
    • Binding assays: test the ability of the motif to interact with its predicted binding partners
    • Functional studies: investigate the role of the motif in the protein's cellular function or localization