💻Computational Biology Unit 2 Review

2.3 Data formats and parsing (FASTA, FASTQ, GenBank, PDB, etc.)

💻Computational Biology
Unit 2 Review

2.3 Data formats and parsing (FASTA, FASTQ, GenBank, PDB, etc.)

Written by the Fiveable Content Team • Last updated September 2025

💻Computational Biology

Unit & Topic Study Guides

2.1 Introduction to biological databases (GenBank, UniProt, PDB, etc.)

2.2 Accessing and retrieving data from databases using web interfaces and APIs

2.3 Data formats and parsing (FASTA, FASTQ, GenBank, PDB, etc.)

Biological data comes in various formats, each serving a specific purpose. FASTA, FASTQ, GenBank, and PDB are common file types used to store genetic sequences, quality scores, annotations, and protein structures. Understanding these formats is crucial for working with biological data.

Parsing these files allows researchers to extract valuable information for analysis. Python libraries like BioPython simplify this process, enabling scientists to manipulate and analyze genetic data efficiently. This knowledge is essential for computational biology and bioinformatics applications.

Biological Data File Formats

Common File Formats in Biological Databases

Recognize common file formats used in biological databases
- FASTA represents nucleotide or amino acid sequences with a header line starting with ">" (DNA, RNA, protein sequences)
- FASTQ stores biological sequences and their corresponding quality scores, commonly used for high-throughput sequencing data (Illumina, PacBio)
- GenBank format used by the NCBI database to store annotated nucleotide sequences, including features such as genes, regulatory elements, and translations
- PDB (Protein Data Bank) format stores 3D structural information of biological macromolecules (proteins, nucleic acids)
- Other common formats include
  - SAM/BAM (Sequence Alignment/Map format)
  - VCF (Variant Call Format)
  - GFF (General Feature Format)

FASTA, FASTQ, GenBank, and PDB Structure

FASTA Format Structure

FASTA format consists of two main components
- Header line starting with ">" followed by the sequence identifier and optional description
- Subsequent lines containing the sequence data (nucleotides or amino acids)

Example FASTA record:

>sequence_identifier optional description
ATGCTAGCTACGATCGATCGATCGATCGTAGCTAGCATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG

FASTQ Format Structure

FASTQ format includes four lines per record
- Header starting with "@" containing sequence identifier and optional description
- Sequence line containing the raw sequence data (nucleotides)
- Separator line consisting of a "+" sign
- Quality score line containing ASCII characters representing the quality scores for each base in the sequence

Example FASTQ record:

@sequence_identifier optional description
ATGCTAGCTACGATCGATCGATCGATCGTAGCTAGCATCG
+
!''*((((***+))%%%++)(%%%%).1***-+*''))

GenBank Format Structure

GenBank format is divided into fields, each starting with a specific keyword followed by the corresponding information
- LOCUS field provides a brief description of the sequence, including its length, type, and accession number
- DEFINITION field gives a concise description of the sequence
- ACCESSION field lists the unique identifier assigned to the sequence by GenBank
- FEATURES field contains annotations of the sequence, such as genes, coding regions, and regulatory elements

Example GenBank record snippet:

LOCUS       SCU49845                5028 bp    DNA     linear   PLN 23-MAR-2010
DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds.
ACCESSION   U49845
FEATURES             Location/Qualifiers
     source          1..5028
                     /organism="Saccharomyces cerevisiae"
                     /db_xref="taxon:4932"

PDB Format Structure

PDB format is divided into sections with specific column-based formatting for each section
- HEADER section contains general information about the structure, such as the experimental method and resolution
- TITLE section provides a descriptive title for the structure
- ATOM section contains the 3D coordinates and additional information for each atom in the structure

Example PDB record snippet:

HEADER    TRANSFERASE                             22-NOV-91   1TRP
TITLE     REFINEMENT OF INDOLE-3-GLYCEROPHOSPHATE SYNTHASE FROM YEAST
ATOM      1  N   TRP A   1      17.047  14.099   3.625  1.00 13.79           N
ATOM      2  CA  TRP A   1      16.967  12.784   4.338  1.00 10.80           C

Data Extraction from File Formats

Parsing Biological Data Files

Use programming languages to read and parse biological data files
- Python, R, and Perl are commonly used for parsing biological data
- Utilize built-in functions or libraries to handle specific file formats and simplify data extraction
  - BioPython library in Python
  - Bioconductor packages in R
Implement custom parsing algorithms to extract relevant information from the files
- Sequence identifiers
- Raw sequences (nucleotides or amino acids)
- Quality scores (in FASTQ files)
- Annotation details (in GenBank files)
Handle edge cases and potential formatting inconsistencies in the input files to ensure robust parsing
- Missing or malformed headers
- Inconsistent line breaks or delimiters
- Incomplete or corrupted records

Data Conversion and Quality Control

Convert data between different file formats as needed for downstream analyses or compatibility with specific tools
- Convert FASTQ to FASTA format by extracting only the sequence data
- Convert GenBank to FASTA format by extracting the nucleotide sequences
- Convert PDB to FASTA format by extracting the amino acid sequences
Perform quality control checks on the parsed data
- Filter low-quality sequences based on quality score thresholds (FASTQ)
- Trim adapters or contaminating sequences
- Remove duplicate sequences
- Validate the integrity and completeness of the parsed data

Data Manipulation and Analysis

Data Integration and Computational Tasks

Integrate data from multiple file formats to gain a comprehensive understanding of the biological system under study
- Combine sequence data (FASTA) with quality scores (FASTQ) and annotations (GenBank)
- Integrate structural information (PDB) with functional annotations (GenBank)
Use the extracted data for various computational tasks
- Sequence alignment (pairwise or multiple sequence alignment)
- Variant calling (identifying genetic variations)
- Structure prediction (predicting protein 3D structures)
- Data visualization (generating plots, graphs, or interactive visualizations)

Statistical Analysis and Machine Learning

Apply statistical methods to analyze and interpret the data obtained from the parsed files
- Calculate sequence similarity scores or distances
- Perform statistical tests to identify significant differences or associations
- Conduct enrichment analyses to identify overrepresented functional categories or motifs
Utilize machine learning techniques to extract insights and make predictions based on the parsed data
- Train classifiers to predict protein functions or subcellular localization
- Develop predictive models for disease diagnosis or drug response based on genetic variations
- Apply clustering algorithms to identify patterns or groups within the data

💻Computational Biology Unit 2 Review

2.3 Data formats and parsing (FASTA, FASTQ, GenBank, PDB, etc.)

💻Computational Biology Unit 2 Review

2.3 Data formats and parsing (FASTA, FASTQ, GenBank, PDB, etc.)

Unit & Topic Study Guides

Biological Data File Formats

Common File Formats in Biological Databases

FASTA, FASTQ, GenBank, and PDB Structure

FASTA Format Structure

FASTQ Format Structure

GenBank Format Structure

PDB Format Structure

Data Extraction from File Formats

Parsing Biological Data Files

Data Conversion and Quality Control

Data Manipulation and Analysis

Data Integration and Computational Tasks

Statistical Analysis and Machine Learning

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

💻Computational Biology
Unit 2 Review