Summarize Data

Now we’ll add on two more functions from they {dplyr} package to complete or set of the basic wrangling functions: summarize() and group_by().

summarize()

It is often useful to wrangle your data into summary tables. You can get basic summary stats with the summary() function, but we can get more detailed or specific tables using summarize().

Are those functions annoyingly closely named? Yes. This is sort of the product of anyone being able to write libraries for R. summary() comes from base R and summarize() comes from {dplyr} in the {tidyverse} package. Sometimes people use the exact same name for functions across different packages and this can cause problems that have a simple fix (specifying the package name) but are often frustratingly hard to recognize.

The summarize() function will make a new data table based on how you want to quantify your data. A typical example is to use it with basic statistical functions like mean() and sd().

Let’s find a summary table of mean and standard deviation for the penguin body_mass_g variable. Here’s the code:

You’ll notice that you get NA for your outcomes. This is because summarize() does not automatically remove NA values like summary() does. There are a couple ways you can get around that. You can add the argument na.rm = TRUE inside the mean() function. Here’s an example:

You could also use the filter() function that we learned in the last workshop with the function !is.na(). Try using that before summarizing:

Note that the names in the new table are just the names of the function, but if you want, you can give them explicit names:

Functions useful with summarize()

Any of the functions that calculate basic statistics are often paired with summarize(). Here’s a few of the most common ones.

  • mean()
  • median()
  • min()
  • max()
  • IQR()
  • sd()
  • n()
  • n_distinct()

Of these, all work how you'd expect, with the possible exception of `n()`. With `n()` the function will count up whatever is passed to it, but it does not require an argument inside the `()`. So, if we wanted to add a value for total number of penguins, the code would be as follows:  

```{webr-r}
penguins |>
  filter(!is.na(body_mass_g)) |>
  summarize(average = mean(body_mass_g), 
            st_dev = sd(body_mass_g), 
            total = n())

The reason for this will become more clear once we learn about group_by().