💻Computational Biology Unit 6 Review

6.2 Protein sequence analysis and motif discovery

💻Computational Biology
Unit 6 Review

6.2 Protein sequence analysis and motif discovery

Written by the Fiveable Content Team • Last updated September 2025

💻Computational Biology

Unit & Topic Study Guides

6.1 Introduction to proteomics and mass spectrometry

6.2 Protein sequence analysis and motif discovery

6.3 Protein structure prediction and modeling

6.4 Protein-protein interaction networks

Protein sequence analysis and motif discovery are crucial tools in proteomics. They help scientists uncover hidden patterns and functional regions within proteins, shedding light on their roles and relationships. These techniques are essential for understanding protein evolution, function, and interactions.

By analyzing sequences and identifying motifs, researchers can predict protein structures, functions, and potential modifications. This knowledge is vital for drug discovery, understanding disease mechanisms, and developing targeted therapies. It's a cornerstone of modern proteomics and protein analysis.

Sequence Alignment and Homology Searching

Pairwise and Multiple Sequence Alignment

Sequence alignment identifies regions of similarity between two or more protein sequences, indicating potential functional, structural, or evolutionary relationships
Pairwise sequence alignment compares two sequences at a time, finding the best-matching alignment between them
- Needleman-Wunsch algorithm performs global alignment, aligning entire sequences from end to end
- Smith-Waterman algorithm performs local alignment, identifying the best-matching subsequences within the sequences
Multiple sequence alignment (MSA) simultaneously aligns three or more sequences
- Identifies conserved regions, motifs, and evolutionary relationships across a set of related proteins
- Popular MSA algorithms include ClustalW, MUSCLE, and T-Coffee, each with different strategies for optimizing the alignment

Homology Searching with BLAST

Homology searching finds sequences in a database that are related to a query sequence
- Helps identify potential homologs (sequences with shared ancestry), orthologs (sequences in different species that evolved from a common ancestor), and paralogs (sequences within the same species that arose by gene duplication)
BLAST (Basic Local Alignment Search Tool) is a widely used algorithm for homology searching
- Compares a query sequence against a database of sequences, returning statistically significant matches based on sequence similarity
- Breaks the query sequence into shorter segments (words), finds exact matches in the database, and extends the matches to optimize the alignment
Interpreting BLAST results involves understanding key parameters:
- E-value: the expected number of matches that would occur by chance in a database of the given size (lower E-values indicate more significant matches)
- Bit score: a measure of alignment quality that takes into account the alignment length and the scoring matrix used (higher bit scores indicate better alignments)
- Percent identity: the proportion of identical residues in the aligned region (higher percent identity suggests closer evolutionary relationships)

Conserved Domains and Motifs in Proteins

Identifying Conserved Domains

Conserved domains are regions of a protein sequence that have remained relatively unchanged throughout evolution, often due to their functional or structural importance
Domain databases contain curated alignments and models of known protein domains, facilitating their identification in query sequences
- Pfam: a comprehensive database of protein families and domains, based on hidden Markov models (HMMs)
- SMART: a database of signaling and extracellular domains, as well as repeats and low-complexity regions
- CDD: the Conserved Domain Database, integrating domain information from multiple sources, including Pfam and SMART
Sequence logos visualize conservation patterns in a multiple sequence alignment
- Display the relative frequency and conservation of amino acids at each position in the alignment
- Help identify conserved regions and potential functional sites (catalytic residues, binding sites, or structurally important positions)

Short Linear Motifs and Regular Expressions

Short linear motifs (SLiMs) are short, conserved patterns of amino acids that often serve as recognition sites for protein-protein interactions, post-translational modifications, or subcellular localization signals
- Typically 3-10 amino acids in length, with a few highly conserved positions that define the motif's function
- Examples include the PDZ-binding motif (S/T-X-V/L), the nuclear localization signal (K-K/R-X-K/R), and the SUMO-interacting motif (V/I-X-V/I-V/I)
Regular expressions define and search for specific sequence patterns or motifs in proteins
- Use wildcards, position-specific constraints, and repetitions to describe a motif of interest
- Example: [RK]-X-[ST]-X-[LIVMF] describes a potential phosphorylation site, where X represents any amino acid
Motif databases catalog known short linear motifs and their associated functions
- ELM (Eukaryotic Linear Motif): a database of experimentally validated motifs, organized by functional categories (ligand-binding, targeting, cleavage, etc.)
- MiniMotif: a database of minimotifs (short functional motifs), including their sequence patterns, binding partners, and biological roles

Predicting Post-Translational Modifications

Common Types of Post-Translational Modifications

Post-translational modifications (PTMs) are chemical modifications of amino acids that occur after protein synthesis, often regulating protein function, localization, and interactions
Phosphorylation: the addition of a phosphate group to serine, threonine, or tyrosine residues
- Regulates protein activity, protein-protein interactions, and signaling cascades
- Prediction tools: NetPhos (predicts phosphorylation sites using neural networks), GPS (Group-based Prediction System, uses a hierarchical algorithm to predict kinase-specific phosphorylation sites)
Glycosylation: the attachment of carbohydrate molecules to specific amino acids
- Affects protein folding, stability, and recognition
- N-linked glycosylation occurs on asparagine residues in the sequence context Asn-X-Ser/Thr (where X is any amino acid except proline)
- O-linked glycosylation occurs on serine or threonine residues, with no specific sequence motif
- Prediction tools: NetNGlyc (predicts N-glycosylation sites), NetOGlyc (predicts O-glycosylation sites)
Ubiquitination: the covalent attachment of ubiquitin (a small regulatory protein) to lysine residues
- Targets proteins for degradation by the proteasome, regulates protein turnover and signaling
- Prediction tools: UbPred (predicts ubiquitination sites using a random forest classifier), BDM-PUB (Bidirectional Motif-Based Prediction of Ubiquitination Sites, uses sequence motifs and structural information)
Methylation and acetylation: PTMs that can occur on lysine and arginine residues
- Regulate gene expression, protein-protein interactions, and protein stability
- Prediction tools: GPS-MSP (predicts methylation sites), ASEB (predicts acetylation sites using support vector machines)

Functional Site Prediction

Functional site prediction tools aim to identify regions of a protein that are important for its function, such as catalytic sites, ligand-binding sites, or protein-protein interaction interfaces
Consurf: a tool that uses evolutionary conservation to predict functionally important regions in a protein
- Aligns the query sequence with homologs from different species and calculates a conservation score for each position
- Highly conserved positions are likely to be functionally important, while variable positions are less likely to be essential
FunFOLD: a tool that combines sequence conservation and structural information to predict functional sites
- Uses a template-based approach, comparing the query protein to structurally similar proteins with known functional sites
- Transfers the functional site information from the templates to the query protein, based on sequence and structural similarity

De Novo Motif Discovery in Proteins

Principles and Algorithms for De Novo Motif Discovery

De novo motif discovery identifies novel, overrepresented sequence patterns in a set of proteins without relying on prior knowledge of known motifs
Motif discovery algorithms search for patterns that occur more frequently in a set of sequences than would be expected by chance, often using statistical measures to assess significance
MEME (Multiple Em for Motif Elicitation) suite: a widely used tool for de novo motif discovery
- Uses an expectation-maximization algorithm to identify overrepresented sequence patterns in a set of proteins
- Iteratively refines the motif model by optimizing the likelihood of the data given the motif
Gibbs sampling: a probabilistic approach for de novo motif discovery
- Uses a sampling method to identify overrepresented patterns by iteratively updating the motif model and the alignment of sequences to the motif
- Tools like AlignACE and GibbsCluster implement this approach

Challenges and Evaluation of De Novo Motif Discovery

Challenges in de novo motif discovery include:
- Distinguishing true motifs from random patterns that occur by chance
- Determining the appropriate motif length and number of occurrences in the dataset
- Assessing the biological significance of discovered motifs
Evaluating the quality of discovered motifs often involves measures such as:
- Information content (IC): quantifies the conservation of a motif by measuring the difference between the observed frequency of amino acids at each position and the background frequency
- E-value: assesses the statistical significance of a motif's occurrence in the dataset, representing the expected number of motifs with a similar score that would occur by chance
Experimental validation is crucial to confirm the biological relevance of discovered motifs
- Site-directed mutagenesis: introduces specific mutations in the motif to assess its functional importance
- Binding assays: test the ability of the motif to interact with its predicted binding partners
- Functional studies: investigate the role of the motif in the protein's cellular function or localization

💻Computational Biology Unit 6 Review

6.2 Protein sequence analysis and motif discovery

💻Computational Biology
Unit 6 Review

6.2 Protein sequence analysis and motif discovery

Unit & Topic Study Guides

Sequence Alignment and Homology Searching

Pairwise and Multiple Sequence Alignment

Homology Searching with BLAST

Conserved Domains and Motifs in Proteins

Identifying Conserved Domains

Short Linear Motifs and Regular Expressions

Predicting Post-Translational Modifications

Common Types of Post-Translational Modifications

Functional Site Prediction

De Novo Motif Discovery in Proteins

Principles and Algorithms for De Novo Motif Discovery

Challenges and Evaluation of De Novo Motif Discovery

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

Study Content & Tools

Company

Resources

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes