Add-on Functions
Most of the functions we have learned in this and the previous workshop come from the {dplyr} package. The big five that you’ll probably use most frequently are select(), filter(), mutate(), group_by(), and summarize(). But there are a few other {dplyr} functions that are useful, especially in conjunction with those.
rename()
rename() does what it sounds like, renames a column with the format new_name = old_name. Here’s an example, note that neither of the names are in quotes:
arrange()
If you want to sort your rows from highest to lowest value or vice versa, you would use the function arrange(). The default for arrange() is to start with the smallest value first. Here we’ve ordered the penguin dataset based on bill_length_mm.
If you’d like your data to go the opposite direction and start with the highest value, you just wrap your variable with the desc() function. Here’s the same data in descending order:
You can also arrange by multiple factors by listing more variables in the (). Doesn’t work so well for the penguins since they are mostly unique values, but here’s an example:
First the data is arranged by column x but where there are repeated values it is then secondly arranged by column y.
slice()
slice() is a function that will cut out whatever rows you specify. Here’s an example where we’re cutting out rows 5, 6, and 7. (Run just penguins if you want to verify these are the correct rows.)
slice() has a number of sister functions that work similarly: there’s slice_head() and slice_tail() which will take out rows from the beginning or the end of the dataset, respectively. For these functions, you need to specify n = for the number of rows you want displayed. See if you can figure out the code to slice the first three rows of the data.
In a similar fashion, you can use slice_min() or slice_max() to cut out a specified number of the lowest or highest values in a column.
slice_sample() is also useful for extracting a specified number of random rows.
distinct()
This function will subset your data to only keep unique rows. If you leave the () empty it will remove any rows that have duplicate values across all columns:
Now you have reduced the dataset by one row to remove the duplicated row 4.
You can also find distinct values from single columns or multiple columns. Running distinct() on a single column is identical to running unique() from base R. But if you list multiple columns it will show you all distinct possible values from both.
You can see that the pair (3, 2) is only listed once even though there are two (3, 2) combinations in the original dataset.