Fiveable

๐Ÿ’ปAdvanced R Programming Unit 4 Review

QR code for Advanced R Programming practice questions

4.4 Data manipulation with dplyr

๐Ÿ’ปAdvanced R Programming
Unit 4 Review

4.4 Data manipulation with dplyr

Written by the Fiveable Content Team โ€ข Last updated September 2025
Written by the Fiveable Content Team โ€ข Last updated September 2025
๐Ÿ’ปAdvanced R Programming
Unit & Topic Study Guides

The dplyr package in R is a game-changer for data manipulation. It offers a set of powerful functions that make it easy to select, filter, arrange, and summarize data. These tools allow you to quickly wrangle your data into the shape you need.

With dplyr, you can chain operations together using the pipe operator, creating efficient data pipelines. This approach streamlines your code, making it more readable and easier to maintain. By mastering dplyr, you'll be able to handle complex data tasks with ease.

Data manipulation with dplyr

Selecting and filtering data

  • Select columns (variables) from a data frame using the select() function
    • Specify column names or positions to subset the data frame
    • Rename columns using the syntax new_name = old_name
  • Subset rows (observations) from a data frame based on logical conditions using the filter() function
    • Combine multiple conditions using Boolean operators (&, |, !)
    • Example: filter(df, age > 18 & city == "New York")
  • Remove duplicate rows from a data frame using the distinct() function
    • Specify columns to consider for uniqueness or apply to the entire data frame
    • Example: distinct(df, id, name)
  • Select rows by their integer indices using the slice() function
    • Similar to base R subsetting with square brackets
    • Example: slice(df, 1:10) selects the first 10 rows

Arranging and sorting data

  • Sort the rows of a data frame based on one or more columns using the arrange() function
    • By default, sorts in ascending order
    • Use desc() to sort in descending order
    • Example: arrange(df, desc(age), name)
  • Combine arrange() with other dplyr functions for more complex sorting
    • Example: df %>% filter(city == "New York") %>% arrange(desc(salary))
    • Sorts the filtered data frame by salary in descending order

Creating and summarizing variables

Creating and modifying variables

  • Create new columns or modify existing columns using the mutate() function
    • Perform calculations, apply functions, or use conditional logic to define new values
    • Example: mutate(df, new_col = old_col 2, is_adult = age >= 18)
  • Use the transmute() function to create new columns and drop all other columns
    • Similar to mutate() but keeps only the newly created or modified columns
    • Example: transmute(df, double_age = age 2)
  • Apply functions to multiple columns using the across() function within mutate()
    • Use column names or selection helpers (starts_with(), ends_with(), contains())
    • Example: mutate(df, across(starts_with("score_"), ~ . / 100))

Summarizing data

  • Calculate summary statistics for one or more columns using the summarize() function
    • Returns a new data frame with one row per summarized group
    • Example: summarize(df, mean_age = mean(age), max_score = max(score))
  • Use the across() function within summarize() to apply functions to multiple columns
    • Example: summarize(df, across(starts_with("score_"), mean))
  • Count the number of rows in each group using the count() function
    • Shortcut for group_by() followed by summarize()
    • Example: count(df, city) counts the number of rows for each unique city

Grouped operations in dplyr

Grouping data

  • Split a data frame into groups based on one or more columns using the group_by() function
    • Subsequent operations (summarize(), mutate()) will be applied independently to each group
    • Example: group_by(df, city, gender)
  • Remove the grouping structure from a data frame using the ungroup() function
    • Subsequent operations are applied to the entire data frame as a whole
    • Example: df %>% group_by(city) %>% summarize(mean_age = mean(age)) %>% ungroup()

Group-wise operations

  • Count the number of rows in the current group using the n() function within summarize()
    • Example: summarize(df, group_size = n())
  • Count the number of unique values in a column for the current group using the n_distinct() function within summarize()
    • Example: summarize(df, unique_cities = n_distinct(city))
  • Return the first, last, or nth value of a column for each group using first(), last(), or nth() within summarize()
    • Example: summarize(df, first_name = first(name), last_score = last(score))

Efficient data pipelines in dplyr

Chaining functions with the pipe operator

  • Use the pipe operator (%>%) from the magrittr package to chain multiple dplyr functions together
    • Creates a readable and efficient data manipulation pipeline
    • Passes the result of the previous function as the first argument to the next function
    • Example: df %>% filter(age > 18) %>% group_by(city) %>% summarize(mean_income = mean(income))
  • Break down complex data manipulations into a series of smaller, more manageable steps using the pipe operator
    • Improves code readability and maintainability
    • Example: df %>% select(id, name, age) %>% filter(age >= 18) %>% mutate(adult = TRUE)

Avoiding intermediate variables

  • Use the pipe operator to avoid creating intermediate variables
    • Leads to cleaner and more concise code
    • Example: Instead of filtered_df <- filter(df, age > 18); summarized_df <- summarize(filtered_df, mean_age = mean(age)), use df %>% filter(age > 18) %>% summarize(mean_age = mean(age))
  • Ensure that the output of each step in the pipeline is compatible with the input expected by the next function
    • Pay attention to the structure and column names of the data frame at each step
    • Example: df %>% select(id, name) %>% group_by(id) %>% summarize(name_count = n()) works because id is selected before grouping