🎲Data Science Statistics Unit 20 Review

20.2 Data Manipulation and Cleaning

🎲Data Science Statistics
Unit 20 Review

20.2 Data Manipulation and Cleaning

Written by the Fiveable Content Team • Last updated September 2025

🎲Data Science Statistics

Unit & Topic Study Guides

20.1 Introduction to R or Python for Statistical Analysis

20.2 Data Manipulation and Cleaning

20.3 Advanced Data Visualization Techniques

20.4 Reproducible Research and Reporting

Data manipulation and cleaning are crucial skills in data science. These techniques transform raw, messy data into a structured format suitable for analysis. From handling missing values to creating new features, these methods ensure your data is ready for statistical modeling and visualization.

Understanding data preprocessing empowers you to extract meaningful insights from complex datasets. By mastering these techniques, you'll be able to tackle real-world data challenges and prepare high-quality inputs for your statistical analyses and visualizations.

Data Preprocessing

Data Wrangling and Transformation

Data wrangling involves cleaning and organizing raw data into a usable format
Data preprocessing prepares raw data for analysis through cleaning, formatting, and structuring
Data transformation modifies data structure or values to improve analysis (logarithmic transformations)
Data filtering selects specific subsets of data based on defined criteria
- Removes irrelevant or low-quality data points
- Uses logical operators to create filter conditions (df[df['column'] > 5])
Data aggregation combines multiple data points into summary statistics
- Calculates metrics like mean, median, or sum for groups of data
- Often uses groupby() function in pandas to aggregate by categories

Advanced Data Manipulation Techniques

Pivot tables restructure data to create summary views
- Transforms long-format data into wide-format for analysis
- Uses pivot_table() function in pandas
Window functions perform calculations across a set of rows related to the current row
- Includes rolling averages, cumulative sums, and ranking
- Implemented using rolling(), expanding(), or window() functions
Data binning groups continuous data into discrete intervals
- Simplifies analysis and visualization of numerical data
- Can use pd.cut() or pd.qcut() for equal-width or equal-frequency binning

Handling Missing Data and Outliers

Identifying and Addressing Missing Data

Missing data occurs when values are not present for some variables or observations
Types of missing data include Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR)
Strategies for handling missing data:
- Deletion methods remove rows or columns with missing values
  - Listwise deletion removes entire rows with any missing values
  - Pairwise deletion removes specific data points only when needed for analysis
- Imputation methods fill in missing values with estimates
  - Mean/median imputation replaces missing values with central tendency measures
  - Regression imputation predicts missing values based on other variables
  - Multiple imputation creates several plausible imputed datasets

Detecting and Managing Outliers

Outliers represent data points significantly different from other observations
Methods for detecting outliers:
- Z-score identifies values beyond a certain number of standard deviations from the mean
- Interquartile Range (IQR) method flags values below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR
- Local Outlier Factor (LOF) algorithm detects outliers in multidimensional datasets
Strategies for handling outliers:
- Trimming removes extreme values from the dataset
- Winsorization caps extreme values at a specified percentile
- Transformation applies mathematical functions to reduce the impact of outliers (log transformation)

Feature Engineering and Scaling

Feature Creation and Selection

Feature engineering creates new variables from existing data to improve model performance
- Interaction terms combine two or more features (multiplying 'height' and 'weight')
- Polynomial features generate new features by raising existing ones to powers
- Domain-specific features incorporate expert knowledge into the dataset
Feature selection techniques choose the most relevant variables for analysis
- Filter methods use statistical tests to select features (correlation coefficients)
- Wrapper methods use model performance to evaluate feature subsets (recursive feature elimination)
- Embedded methods perform feature selection as part of the model training process (Lasso regression)

Data Scaling and Normalization

Data normalization scales features to a common range, typically between 0 and 1
- Min-Max scaling: $x_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}$
- Improves convergence speed for gradient-based algorithms
Data standardization scales features to have zero mean and unit variance
- Z-score standardization: $x_{standardized} = \frac{x - \mu}{\sigma}$
- Useful when features have different units or scales
Data encoding transforms categorical variables into numerical format
- One-hot encoding creates binary columns for each category
- Label encoding assigns a unique integer to each category
- Ordinal encoding preserves the order of categorical variables

Data Restructuring

Data Merging and Joining

Data merging combines multiple datasets based on common keys or columns
Types of joins in data merging:
- Inner join retains only rows with matching keys in both datasets
- Outer join keeps all rows from both datasets, filling in missing values
- Left join keeps all rows from the left dataset and matching rows from the right
- Right join keeps all rows from the right dataset and matching rows from the left
Concatenation appends datasets vertically (along rows) or horizontally (along columns)
- Vertical concatenation combines datasets with the same columns (pd.concat([df1, df2], axis=0))
- Horizontal concatenation combines datasets with the same index (pd.concat([df1, df2], axis=1))

Data Reshaping and Pivoting

Data reshaping transforms the structure of a dataset without changing its content
Long format organizes data with one observation per row and multiple columns for variables
Wide format organizes data with multiple observations per row and separate columns for each variable
Melting converts wide format to long format
- Uses pd.melt() function to unpivot a DataFrame
- Specifies ID variables and value variables for restructuring
Pivoting converts long format to wide format
- Uses pivot() or pivot_table() functions to reshape data
- Allows for aggregation of multiple values in the same cell

🎲Data Science Statistics Unit 20 Review

20.2 Data Manipulation and Cleaning

🎲Data Science Statistics
Unit 20 Review

20.2 Data Manipulation and Cleaning

Unit & Topic Study Guides

Data Preprocessing

Data Wrangling and Transformation

Advanced Data Manipulation Techniques

Handling Missing Data and Outliers

Identifying and Addressing Missing Data

Detecting and Managing Outliers

Feature Engineering and Scaling

Feature Creation and Selection

Data Scaling and Normalization

Data Restructuring

Data Merging and Joining

Data Reshaping and Pivoting

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

Study Content & Tools

Company

Resources

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes