🐛Biostatistics Unit 13 Review

13.2 Data manipulation and visualization using R packages

🐛Biostatistics
Unit 13 Review

13.2 Data manipulation and visualization using R packages

Written by the Fiveable Content Team • Last updated September 2025

🐛Biostatistics

Unit & Topic Study Guides

13.1 Introduction to R and RStudio for biological data analysis

13.2 Data manipulation and visualization using R packages

13.3 Statistical analysis and modeling with R

R packages like dplyr and ggplot2 are game-changers for data manipulation and visualization. They make it easy to wrangle messy datasets, create stunning graphs, and uncover hidden patterns. These tools are essential for biostatistics, letting you focus on analysis rather than getting bogged down in code.

Learning these packages opens up a world of possibilities in data science. You'll be able to clean, transform, and visualize complex biological datasets with ease. Mastering these skills will set you apart in your field and help you tackle real-world research challenges head-on.

Data Manipulation with dplyr

Core Functions for Data Transformation

The dplyr package provides a set of functions for data manipulation and transformation in R, enabling users to efficiently clean, filter, and reshape datasets
The select() function allows users to choose specific columns from a dataset (e.g., select(data, column1, column2))
The filter() function enables subsetting rows based on specified conditions (e.g., filter(data, column1 > 5))
The mutate() function is used to create new columns or modify existing ones by applying transformations or calculations to the data (e.g., mutate(data, new_column = column1 + column2))
The arrange() function sorts the rows of a dataset based on one or more columns, in ascending or descending order (e.g., arrange(data, column1, desc(column2)))

Grouping and Distinct Operations

The group_by() function is used to split a dataset into groups based on one or more variables, allowing for group-wise operations using the summarize() function (e.g., group_by(data, column1) %>% summarize(mean_column2 = mean(column2)))
The distinct() function removes duplicate rows from a dataset based on specified columns (e.g., distinct(data, column1, column2))
The sample_n() and sample_frac() functions enable random sampling of rows from a dataset
- sample_n(data, 100) selects 100 random rows
- sample_frac(data, 0.1) selects a random 10% of the rows

Data Visualization with ggplot2

Building Blocks of ggplot2

ggplot2 is a powerful and flexible package for creating high-quality visualizations in R, based on the Grammar of Graphics
The ggplot() function is the foundation of the package, which takes a dataset and aesthetic mappings (aes()) as arguments to define the plot's basic structure (e.g., ggplot(data, aes(x = column1, y = column2)))
Geometries (geom_()) are added to the plot to represent the data, such as points (geom_point()), lines (geom_line()), bars (geom_bar()), or boxplots (geom_boxplot())
Scales (scale_()) are used to control the mapping of data values to visual properties, such as colors (scale_color_()) or sizes (scale_size_())
Facets (facet_wrap() and facet_grid()) allow for the creation of small multiples, displaying subsets of the data in separate panels based on one or more categorical variables (e.g., facet_wrap(~ category))

Customizing and Annotating Plots

Themes (theme_()) and manual theme adjustments (theme()) enable customization of the plot's appearance, including background, text, and legend settings (e.g., theme_minimal() or theme(legend.position = "bottom"))
Labels (labs()), titles (ggtitle()), and annotations (annotate()) are used to add informative text elements to the plot, enhancing its readability and interpretation
- labs(x = "X-axis label", y = "Y-axis label") sets axis labels
- ggtitle("Plot Title") adds a title to the plot
- annotate("text", x = 1, y = 2, label = "Annotation") adds custom text annotations to specific coordinates

Data Summarization with tidyr

Reshaping Data with pivot_longer() and pivot_wider()

The tidyr package provides functions for tidying and reshaping data, making it easier to work with in R and compatible with other tidyverse packages
The pivot_longer() function is used to convert wide-format data into long-format, where each row represents a single observation, and columns represent variables (e.g., pivot_longer(data, cols = c("column1", "column2"), names_to = "variable", values_to = "value"))
The pivot_wider() function is used to convert long-format data into wide-format, where each row represents a unique combination of key variables, and columns represent measured variables (e.g., pivot_wider(data, names_from = "variable", values_from = "value"))

Handling Missing Values and Separating Columns

The separate() function splits a single column into multiple columns based on a specified separator or regular expression (e.g., separate(data, column, into = c("new_column1", "new_column2"), sep = "_"))
The unite() function combines multiple columns into a single column (e.g., unite(data, "new_column", column1, column2, sep = "_"))
The drop_na() function removes rows with missing values (NA) from a dataset, either for specific columns or the entire dataset (e.g., drop_na(data, column1))
The replace_na() function replaces missing values with a specified value or a list of values based on the column type (e.g., replace_na(data, list(column1 = 0, column2 = "Unknown")))
The fill() function is used to fill in missing values in a column with the last non-missing value, useful for carrying forward values in time series or grouped data (e.g., fill(data, column1))

Combining Datasets in R

Merging Datasets with merge()

R provides several functions for combining multiple datasets based on common variables or keys, allowing for efficient data integration and analysis
The merge() function is used to combine two datasets by matching rows based on one or more common columns, resulting in a new dataset containing all matched rows and columns from both input datasets
- The by argument specifies the common column(s) to match on, while the all, all.x, and all.y arguments control the inclusion of unmatched rows from either or both datasets (e.g., merge(data1, data2, by = "common_column", all.x = TRUE))

Joining Datasets with dplyr

The dplyr package offers join functions that combine datasets based on common keys, with different types of joins available depending on the desired output
- inner_join() returns only the rows that have matching keys in both datasets (e.g., inner_join(data1, data2, by = "key_column"))
- left_join() and right_join() return all rows from the left or right dataset, respectively, and any matched rows from the other dataset (e.g., left_join(data1, data2, by = "key_column"))
- full_join() returns all rows from both datasets, with NA values filled in for unmatched rows (e.g., full_join(data1, data2, by = "key_column"))
- semi_join() and anti_join() return rows from the left dataset that have (semi) or do not have (anti) a match in the right dataset, without including columns from the right dataset (e.g., semi_join(data1, data2, by = "key_column"))
When combining datasets, it is essential to ensure that the common columns have the same data type and format to avoid issues with matching and merging

🐛Biostatistics Unit 13 Review

13.2 Data manipulation and visualization using R packages

🐛Biostatistics
Unit 13 Review

13.2 Data manipulation and visualization using R packages

Unit & Topic Study Guides

Data Manipulation with dplyr

Core Functions for Data Transformation

Grouping and Distinct Operations

Data Visualization with ggplot2

Building Blocks of ggplot2

Customizing and Annotating Plots

Data Summarization with tidyr

Reshaping Data with pivot_longer() and pivot_wider()

Handling Missing Values and Separating Columns

Combining Datasets in R

Merging Datasets with merge()

Joining Datasets with dplyr

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

Study Content & Tools

Company

Resources

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes