Fiveable

๐Ÿ“ŠBusiness Intelligence Unit 11 Review

QR code for Business Intelligence practice questions

11.3 MapReduce and HDFS

๐Ÿ“ŠBusiness Intelligence
Unit 11 Review

11.3 MapReduce and HDFS

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿ“ŠBusiness Intelligence
Unit & Topic Study Guides

MapReduce is a powerful programming model for processing big data across distributed systems. It breaks down complex tasks into manageable chunks, allowing for efficient parallel processing of large datasets on clusters of computers.

HDFS is the backbone of Hadoop's data storage system. It's designed to handle massive amounts of data, spread across multiple machines, while ensuring fault tolerance and high availability. This distributed approach allows for scalable and reliable data management.

MapReduce Programming Model

MapReduce programming model components

  • Distributed programming model processes large datasets across multiple nodes in a cluster
  • Handles big data processing enabling parallel processing for improved performance and scalability
  • Map phase processes input data generating intermediate key-value pairs
    • Divides input data into splits distributed across multiple nodes
    • Map tasks process single splits producing intermediate key-value pairs (word count, log analysis)
  • Reduce phase aggregates and processes intermediate key-value pairs
    • Sorts and groups intermediate key-value pairs by key
    • Reduce tasks process grouped values for specific keys producing final output (sum, average)
  • Shuffle and sort transfers intermediate key-value pairs from map to reduce tasks
    • Sorts intermediate key-value pairs by key before sending to reduce tasks
  • Relies on master-slave architecture
    • Master node manages task distribution and monitors progress
      • JobTracker coordinates MapReduce job execution assigning tasks to slave nodes
    • Slave nodes execute map and reduce tasks assigned by master
      • TaskTracker runs on each slave node executing assigned tasks (data processing, result generation)

Writing MapReduce programs

  • Typically written in Java but can use other languages like Python with Hadoop Streaming
  • Define mapper function
    • Processes input key-value pairs generating intermediate key-value pairs
    • Implements map() method taking key-value pair as input emitting intermediate key-value pairs
  • Define reducer function
    • Processes grouped intermediate key-value pairs generating final output
    • Implements reduce() method taking key and list of values as input emitting final key-value pairs
  • Configure MapReduce job
    • Specify input and output paths formats and mapper and reducer classes
    • Set additional configuration parameters required for job (number of reducers, memory allocation)
  • Word count example
    • Mapper emits <word, 1> for each word in input text
    • Reducer sums counts for each word emitting <word, total_count>
    • Final output contains each unique word and its total count in input dataset

Hadoop Distributed File System (HDFS)

Architecture of Hadoop Distributed File System

  • Distributed file system stores and manages large datasets across multiple cluster nodes
  • Fault-tolerant automatically replicating data across nodes ensuring data availability and reliability
  • Scalable allowing addition of nodes to increase storage capacity and performance
  • High throughput optimized for batch processing and large data transfers
  • NameNode master node manages file system namespace and regulates client access to files
    • Maintains file system tree and metadata for all files and directories
    • Coordinates data replication and distribution across cluster (block placement, replication factor)
  • DataNode slave node stores actual data in form of blocks
    • Divides files into blocks (default 128 MB) distributed across DataNodes
    • Serves read and write requests from clients (data retrieval, block storage)
  • Secondary NameNode performs periodic checkpoints of NameNode's metadata preventing metadata loss and reducing restart time

Data management in HDFS

  • Data storage divides files into fixed-size blocks (default 128 MB) distributed across DataNodes
    • Stores each block as separate file on DataNode's local file system
    • Replicates blocks across DataNodes (default replication factor 3) for fault-tolerance
  • Data replication determined by NameNode based on replication factor and cluster topology
    • Places replicas on different racks ensuring data availability during rack failures
    • DataNodes send heartbeats and block reports to NameNode maintaining replication status
  • Data access through NameNode providing file system namespace and metadata
    • Read operations
      1. Client requests block locations from NameNode
      2. Client directly reads data from nearest DataNode holding required block
    • Write operations
      1. Client requests NameNode to allocate new block for writing
      2. Client writes data to DataNodes in pipeline fashion
      3. DataNodes replicate data ensuring desired replication factor is maintained