Wrangling into Summary Tables

Now we’ll add on two more wrangling functions that help us make summary tables: summarize() and group_by(). Summary tables are usually full of descriptive statistics—like mean, median, and standard deviation—that help us to make sense of our data.

summarize()

The summarize() function will make a new data table based on how you want to quantify your data. A typical example is to use it with basic statistical functions like mean() and sd().

Let’s find a summary table of mean and standard deviation for the penguin body_mass_g variable. Here’s the code:

You’ll notice that you get NA for your outcomes. This is because summarize() does not automatically remove missing values that are labeled NA. There are a couple ways you can get around that. You can add the argument na.rm = TRUE inside the functions:

You could also use the filter() function to filter out things that are missing with the is.na() function. is.na() will give you all of the functions with missing values. If you want to remove missing values, you add the ! operator in front like this !is.na() to pull out all the values that are not missing.

Try adding to this code to remove the missing weights before piping to summarize.

Any of the functions that calculate basic statistics are often paired with summarize(). Here’s a few of the most common ones.

-   `mean()`
-   `median()`
-   `min()`
-   `max()`
-   `IQR()`
-   `sd()`
-   `n()`
-   `n_distinct()`

Adding in a Grouping Factor

summarize() on its own isn’t much more useful than summary(), but we can combine it with group_by() to get a much more refined look at our data. Combined they allow us to break down data by the values within columns, which shows us division between our variables.

Using group_by() alone doesn’t really look like it does anything, but it is adding hidden structure that allows summarize() to understand unique observations in each column. Just to demonstrate, we’ll group our data by species. Here’s our code with just group_by():

Pretty pointless, it just shows all the data, minus the filter. But if we group our data before running summarize() the statistics will be separated by each group within our specified variable. So to find mass separated by species, we would do:

Now we see that each unique value within our specified variable counts as its own group and summary statistics are calculated on the group level.

Mutating by Group

You can also use group_by() before mutate() to create new column values for individual groups. Let’s say we wanted to add a column that shows the number of penguins within that group. We can first group by the factors that we want separated and then create the new column. (I’m adding select() on the end here to move the column to be easier to see.)

This table maybe is a little confusing because there are 73 male and 73 female Adelie penguins in the sample, but trust me, it is counting up species and sex separately, which you can see from the NA sex.

If you don’t want to trust me, add filter that just shows the table for Chinstrap penguins.