Summarize by Group
group_by()
summarize() on its own isn’t much more useful than summary(), but we can combine it with group_by() to get a much more refined look at our data. Combined they allow us to break down data by the values within columns, which shows us division between our variables.
Using group_by() alone doesn’t really look like it does anything, but it is adding hidden structure that allows summarize() to understand unique observations in each column. Just to demonstrate, we’ll group our data by species. Here’s our code with just group_by():
Pretty pointless, it just shows all the data, minus the filter. But if we group our data before running summarize() the statistics will be separated by each group within our specified variable. So to find mass separated by species, we would do:
Now we see that each unique value within our specified variable counts as its own group and summary statistics are calculated on the group level.
You can also group by multiple factors. Here we’ll add another layer to group_by() by specifying both species and sex:
You can also use conditional statements within group_by() to avoid first having to mutate to create a column for the condition. Here’s an example where we want to look at size differences between the largest and smallest penguins, so we group them by body_mass_g > 4000:
Mutating by Group
You can also use group_by() before mutate() to create new column values for individual groups. Let’s say we wanted to add a column that shows the number of penguins within that group. We can first group by the factors that we want separated and then create the new column. (I’m adding select() on the end here to move the column to be easier to see.)
This table maybe is a little confusing because there are 73 male and 73 female Adelie penguins in the sample, but trust me, it is counting up species and sex separately, which you can see from the NA sex. (If you don’t want to trust me, add another pipe and the line print(n = 300) to code. That’ll print 300 lines.)
Ungrouping
group_by() adds hidden structuring into your data, which occasionally can be a problem if you forget it is there. Let’s say we saved the code from above as a new table and now wanted to add a column to the data that just shows the total number of penguins, so this would be a column just repeating the same number throughout. We might think the following code would work because we aren’t explicitly grouping before creating total, but run it and see:
You don’t get an overall total, rather you’re still getting the totals for each group. This is because the grouping structure will stay hidden in the table unless you explicitly remove it with the command ungroup(). You don’t need to specify what to ungroup by, just leave the () blank and everything will go back to a single, non-stratified dataset. Here’s the same code but with running ungroup() before the mutate command.
Now we see the grand total for our penguin dataset (minus the two filtered observations).
Another way to do if you’re using summarize() is to add .groups = "drop" onto the end. The following code will group the penguins by species and then remove the grouping structure after finding the average body weight.