Fiveable

๐Ÿ›Biostatistics Unit 13 Review

QR code for Biostatistics practice questions

13.2 Data manipulation and visualization using R packages

๐Ÿ›Biostatistics
Unit 13 Review

13.2 Data manipulation and visualization using R packages

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿ›Biostatistics
Unit & Topic Study Guides

R packages like dplyr and ggplot2 are game-changers for data manipulation and visualization. They make it easy to wrangle messy datasets, create stunning graphs, and uncover hidden patterns. These tools are essential for biostatistics, letting you focus on analysis rather than getting bogged down in code.

Learning these packages opens up a world of possibilities in data science. You'll be able to clean, transform, and visualize complex biological datasets with ease. Mastering these skills will set you apart in your field and help you tackle real-world research challenges head-on.

Data Manipulation with dplyr

Core Functions for Data Transformation

  • The dplyr package provides a set of functions for data manipulation and transformation in R, enabling users to efficiently clean, filter, and reshape datasets
  • The select() function allows users to choose specific columns from a dataset (e.g., select(data, column1, column2))
  • The filter() function enables subsetting rows based on specified conditions (e.g., filter(data, column1 > 5))
  • The mutate() function is used to create new columns or modify existing ones by applying transformations or calculations to the data (e.g., mutate(data, new_column = column1 + column2))
  • The arrange() function sorts the rows of a dataset based on one or more columns, in ascending or descending order (e.g., arrange(data, column1, desc(column2)))

Grouping and Distinct Operations

  • The group_by() function is used to split a dataset into groups based on one or more variables, allowing for group-wise operations using the summarize() function (e.g., group_by(data, column1) %>% summarize(mean_column2 = mean(column2)))
  • The distinct() function removes duplicate rows from a dataset based on specified columns (e.g., distinct(data, column1, column2))
  • The sample_n() and sample_frac() functions enable random sampling of rows from a dataset
    • sample_n(data, 100) selects 100 random rows
    • sample_frac(data, 0.1) selects a random 10% of the rows

Data Visualization with ggplot2

Building Blocks of ggplot2

  • ggplot2 is a powerful and flexible package for creating high-quality visualizations in R, based on the Grammar of Graphics
  • The ggplot() function is the foundation of the package, which takes a dataset and aesthetic mappings (aes()) as arguments to define the plot's basic structure (e.g., ggplot(data, aes(x = column1, y = column2)))
  • Geometries (geom_()) are added to the plot to represent the data, such as points (geom_point()), lines (geom_line()), bars (geom_bar()), or boxplots (geom_boxplot())
  • Scales (scale_()) are used to control the mapping of data values to visual properties, such as colors (scale_color_()) or sizes (scale_size_())
  • Facets (facet_wrap() and facet_grid()) allow for the creation of small multiples, displaying subsets of the data in separate panels based on one or more categorical variables (e.g., facet_wrap(~ category))

Customizing and Annotating Plots

  • Themes (theme_()) and manual theme adjustments (theme()) enable customization of the plot's appearance, including background, text, and legend settings (e.g., theme_minimal() or theme(legend.position = "bottom"))
  • Labels (labs()), titles (ggtitle()), and annotations (annotate()) are used to add informative text elements to the plot, enhancing its readability and interpretation
    • labs(x = "X-axis label", y = "Y-axis label") sets axis labels
    • ggtitle("Plot Title") adds a title to the plot
    • annotate("text", x = 1, y = 2, label = "Annotation") adds custom text annotations to specific coordinates

Data Summarization with tidyr

Reshaping Data with pivot_longer() and pivot_wider()

  • The tidyr package provides functions for tidying and reshaping data, making it easier to work with in R and compatible with other tidyverse packages
  • The pivot_longer() function is used to convert wide-format data into long-format, where each row represents a single observation, and columns represent variables (e.g., pivot_longer(data, cols = c("column1", "column2"), names_to = "variable", values_to = "value"))
  • The pivot_wider() function is used to convert long-format data into wide-format, where each row represents a unique combination of key variables, and columns represent measured variables (e.g., pivot_wider(data, names_from = "variable", values_from = "value"))

Handling Missing Values and Separating Columns

  • The separate() function splits a single column into multiple columns based on a specified separator or regular expression (e.g., separate(data, column, into = c("new_column1", "new_column2"), sep = "_"))
  • The unite() function combines multiple columns into a single column (e.g., unite(data, "new_column", column1, column2, sep = "_"))
  • The drop_na() function removes rows with missing values (NA) from a dataset, either for specific columns or the entire dataset (e.g., drop_na(data, column1))
  • The replace_na() function replaces missing values with a specified value or a list of values based on the column type (e.g., replace_na(data, list(column1 = 0, column2 = "Unknown")))
  • The fill() function is used to fill in missing values in a column with the last non-missing value, useful for carrying forward values in time series or grouped data (e.g., fill(data, column1))

Combining Datasets in R

Merging Datasets with merge()

  • R provides several functions for combining multiple datasets based on common variables or keys, allowing for efficient data integration and analysis
  • The merge() function is used to combine two datasets by matching rows based on one or more common columns, resulting in a new dataset containing all matched rows and columns from both input datasets
    • The by argument specifies the common column(s) to match on, while the all, all.x, and all.y arguments control the inclusion of unmatched rows from either or both datasets (e.g., merge(data1, data2, by = "common_column", all.x = TRUE))

Joining Datasets with dplyr

  • The dplyr package offers join functions that combine datasets based on common keys, with different types of joins available depending on the desired output
    • inner_join() returns only the rows that have matching keys in both datasets (e.g., inner_join(data1, data2, by = "key_column"))
    • left_join() and right_join() return all rows from the left or right dataset, respectively, and any matched rows from the other dataset (e.g., left_join(data1, data2, by = "key_column"))
    • full_join() returns all rows from both datasets, with NA values filled in for unmatched rows (e.g., full_join(data1, data2, by = "key_column"))
    • semi_join() and anti_join() return rows from the left dataset that have (semi) or do not have (anti) a match in the right dataset, without including columns from the right dataset (e.g., semi_join(data1, data2, by = "key_column"))
  • When combining datasets, it is essential to ensure that the common columns have the same data type and format to avoid issues with matching and merging