📊Business Intelligence Unit 11 Review

11.3 MapReduce and HDFS

📊Business Intelligence
Unit 11 Review

11.3 MapReduce and HDFS

Written by the Fiveable Content Team • Last updated September 2025

📊Business Intelligence

Unit & Topic Study Guides

11.1 Big Data Concepts and Challenges

11.2 Hadoop Architecture and Components

11.3 MapReduce and HDFS

11.4 NoSQL Databases and Their Applications

MapReduce is a powerful programming model for processing big data across distributed systems. It breaks down complex tasks into manageable chunks, allowing for efficient parallel processing of large datasets on clusters of computers.

HDFS is the backbone of Hadoop's data storage system. It's designed to handle massive amounts of data, spread across multiple machines, while ensuring fault tolerance and high availability. This distributed approach allows for scalable and reliable data management.

MapReduce Programming Model

MapReduce programming model components

Distributed programming model processes large datasets across multiple nodes in a cluster
Handles big data processing enabling parallel processing for improved performance and scalability
Map phase processes input data generating intermediate key-value pairs
- Divides input data into splits distributed across multiple nodes
- Map tasks process single splits producing intermediate key-value pairs (word count, log analysis)
Reduce phase aggregates and processes intermediate key-value pairs
- Sorts and groups intermediate key-value pairs by key
- Reduce tasks process grouped values for specific keys producing final output (sum, average)
Shuffle and sort transfers intermediate key-value pairs from map to reduce tasks
- Sorts intermediate key-value pairs by key before sending to reduce tasks
Relies on master-slave architecture
- Master node manages task distribution and monitors progress
  - JobTracker coordinates MapReduce job execution assigning tasks to slave nodes
- Slave nodes execute map and reduce tasks assigned by master
  - TaskTracker runs on each slave node executing assigned tasks (data processing, result generation)

Writing MapReduce programs

Typically written in Java but can use other languages like Python with Hadoop Streaming
Define mapper function
- Processes input key-value pairs generating intermediate key-value pairs
- Implements map() method taking key-value pair as input emitting intermediate key-value pairs
Define reducer function
- Processes grouped intermediate key-value pairs generating final output
- Implements reduce() method taking key and list of values as input emitting final key-value pairs
Configure MapReduce job
- Specify input and output paths formats and mapper and reducer classes
- Set additional configuration parameters required for job (number of reducers, memory allocation)
Word count example
- Mapper emits <word, 1> for each word in input text
- Reducer sums counts for each word emitting <word, total_count>
- Final output contains each unique word and its total count in input dataset

Hadoop Distributed File System (HDFS)

Architecture of Hadoop Distributed File System

Distributed file system stores and manages large datasets across multiple cluster nodes
Fault-tolerant automatically replicating data across nodes ensuring data availability and reliability
Scalable allowing addition of nodes to increase storage capacity and performance
High throughput optimized for batch processing and large data transfers
NameNode master node manages file system namespace and regulates client access to files
- Maintains file system tree and metadata for all files and directories
- Coordinates data replication and distribution across cluster (block placement, replication factor)
DataNode slave node stores actual data in form of blocks
- Divides files into blocks (default 128 MB) distributed across DataNodes
- Serves read and write requests from clients (data retrieval, block storage)
Secondary NameNode performs periodic checkpoints of NameNode's metadata preventing metadata loss and reducing restart time

Data management in HDFS

Data storage divides files into fixed-size blocks (default 128 MB) distributed across DataNodes
- Stores each block as separate file on DataNode's local file system
- Replicates blocks across DataNodes (default replication factor 3) for fault-tolerance
Data replication determined by NameNode based on replication factor and cluster topology
- Places replicas on different racks ensuring data availability during rack failures
- DataNodes send heartbeats and block reports to NameNode maintaining replication status
Data access through NameNode providing file system namespace and metadata
- Read operations
  1. Client requests block locations from NameNode
  2. Client directly reads data from nearest DataNode holding required block
- Write operations
  1. Client requests NameNode to allocate new block for writing
  2. Client writes data to DataNodes in pipeline fashion
  3. DataNodes replicate data ensuring desired replication factor is maintained

📊Business Intelligence Unit 11 Review

11.3 MapReduce and HDFS

📊Business Intelligence
Unit 11 Review

11.3 MapReduce and HDFS

Unit & Topic Study Guides

MapReduce Programming Model

MapReduce programming model components

Writing MapReduce programs

Hadoop Distributed File System (HDFS)

Architecture of Hadoop Distributed File System

Data management in HDFS

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

Study Content & Tools

Company

Resources

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes