Data manipulation and cleaning are crucial skills in data science. These techniques transform raw, messy data into a structured format suitable for analysis. From handling missing values to creating new features, these methods ensure your data is ready for statistical modeling and visualization.
Understanding data preprocessing empowers you to extract meaningful insights from complex datasets. By mastering these techniques, you'll be able to tackle real-world data challenges and prepare high-quality inputs for your statistical analyses and visualizations.
Data Preprocessing
Data Wrangling and Transformation
- Data wrangling involves cleaning and organizing raw data into a usable format
- Data preprocessing prepares raw data for analysis through cleaning, formatting, and structuring
- Data transformation modifies data structure or values to improve analysis (logarithmic transformations)
- Data filtering selects specific subsets of data based on defined criteria
- Removes irrelevant or low-quality data points
- Uses logical operators to create filter conditions (
df[df['column'] > 5]
)
- Data aggregation combines multiple data points into summary statistics
- Calculates metrics like mean, median, or sum for groups of data
- Often uses
groupby()
function in pandas to aggregate by categories
Advanced Data Manipulation Techniques
- Pivot tables restructure data to create summary views
- Transforms long-format data into wide-format for analysis
- Uses
pivot_table()
function in pandas
- Window functions perform calculations across a set of rows related to the current row
- Includes rolling averages, cumulative sums, and ranking
- Implemented using
rolling()
,expanding()
, orwindow()
functions
- Data binning groups continuous data into discrete intervals
- Simplifies analysis and visualization of numerical data
- Can use
pd.cut()
orpd.qcut()
for equal-width or equal-frequency binning
Handling Missing Data and Outliers
Identifying and Addressing Missing Data
- Missing data occurs when values are not present for some variables or observations
- Types of missing data include Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR)
- Strategies for handling missing data:
- Deletion methods remove rows or columns with missing values
- Listwise deletion removes entire rows with any missing values
- Pairwise deletion removes specific data points only when needed for analysis
- Imputation methods fill in missing values with estimates
- Mean/median imputation replaces missing values with central tendency measures
- Regression imputation predicts missing values based on other variables
- Multiple imputation creates several plausible imputed datasets
- Deletion methods remove rows or columns with missing values
Detecting and Managing Outliers
- Outliers represent data points significantly different from other observations
- Methods for detecting outliers:
- Z-score identifies values beyond a certain number of standard deviations from the mean
- Interquartile Range (IQR) method flags values below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR
- Local Outlier Factor (LOF) algorithm detects outliers in multidimensional datasets
- Strategies for handling outliers:
- Trimming removes extreme values from the dataset
- Winsorization caps extreme values at a specified percentile
- Transformation applies mathematical functions to reduce the impact of outliers (log transformation)
Feature Engineering and Scaling
Feature Creation and Selection
- Feature engineering creates new variables from existing data to improve model performance
- Interaction terms combine two or more features (multiplying 'height' and 'weight')
- Polynomial features generate new features by raising existing ones to powers
- Domain-specific features incorporate expert knowledge into the dataset
- Feature selection techniques choose the most relevant variables for analysis
- Filter methods use statistical tests to select features (correlation coefficients)
- Wrapper methods use model performance to evaluate feature subsets (recursive feature elimination)
- Embedded methods perform feature selection as part of the model training process (Lasso regression)
Data Scaling and Normalization
- Data normalization scales features to a common range, typically between 0 and 1
- Min-Max scaling:
- Improves convergence speed for gradient-based algorithms
- Data standardization scales features to have zero mean and unit variance
- Z-score standardization:
- Useful when features have different units or scales
- Data encoding transforms categorical variables into numerical format
- One-hot encoding creates binary columns for each category
- Label encoding assigns a unique integer to each category
- Ordinal encoding preserves the order of categorical variables
Data Restructuring
Data Merging and Joining
- Data merging combines multiple datasets based on common keys or columns
- Types of joins in data merging:
- Inner join retains only rows with matching keys in both datasets
- Outer join keeps all rows from both datasets, filling in missing values
- Left join keeps all rows from the left dataset and matching rows from the right
- Right join keeps all rows from the right dataset and matching rows from the left
- Concatenation appends datasets vertically (along rows) or horizontally (along columns)
- Vertical concatenation combines datasets with the same columns (
pd.concat([df1, df2], axis=0)
) - Horizontal concatenation combines datasets with the same index (
pd.concat([df1, df2], axis=1)
)
- Vertical concatenation combines datasets with the same columns (
Data Reshaping and Pivoting
- Data reshaping transforms the structure of a dataset without changing its content
- Long format organizes data with one observation per row and multiple columns for variables
- Wide format organizes data with multiple observations per row and separate columns for each variable
- Melting converts wide format to long format
- Uses
pd.melt()
function to unpivot a DataFrame - Specifies ID variables and value variables for restructuring
- Uses
- Pivoting converts long format to wide format
- Uses
pivot()
orpivot_table()
functions to reshape data - Allows for aggregation of multiple values in the same cell
- Uses