Working with Column Variables

Now let’s learn a few easy functions and get used to doing some column notation. And we’ll put this all together with the library part from earlier.

Built in Data

R has a number of built in datasets that you can use for example calculations. In fact, if you type and run data() you can see all the built in datasets. Packages that you load in also have a number of example datasets; we’ll use one of those from the package {palmerpenguins}.

Putting package names in {} is a style choice, but it’s what the pros do.

We’ll load the package and then assign the dataset in it to a variable name.

library(palmerpenguins)

penguin_data <- penguins

*Note, I could just use the built in object name of penguins, but I wanted to show how you can change and add objects to be whatever you want.

You can click on penguin_data in the variable environment to see what the data looks like. The researchers also looked at size measurements for three species of penguins on three islands over a number of years. (This is real data.)

If I wanted to save a column as it’s own variable that I can assign a name to, I can use <- and $ together. Let’s say I want a variable called location that shows the island data, I would type the following:

location <- penguin_data$island

Note that this stores the data as an object called location (you can tell because it appears in the top right pane), but it doesn’t print out the island names. To do this you have to explicitly tell R you want to see the data. You can do this with print() or just by running the name as its own line.

print(location)

or

location

If a column is numeric (contains only numbers), we can do math on that column. Let’s say that our scale for weighing penguins was off by 10 grams. We can take the whole column of body_mass_g and subtract 10. That code would look like:

penguin_data$body_mass_g - 10

Now lets say we wanted to take the average weight of all penguins using the function mean(). We could do that two ways: we could make a variable for the corrected weight and take its mean:

correct_mass <- penguin_data$body_mass_g - 10
mean(correct_mass)

or we could perform our function directly:

mean(penguin_data$body_mass_g - 10)

You may note that you get NA for the answers. This is because there is some missing data. Let’s create a new dataset and remove the missing data with the function na.omit().

penguin_complete <- na.omit(penguin_data)

Now if you replace penguin_data with penguin_complete, the mean functions above should give numeric answers.

Practice

Try to answer these practice questions that incorporate what we’ve learned so far.

  1. What is the new code for directly finding the mean of the corrected penguin weight?
mean(penguin_complete$body_mass_g - 10)
  1. How would I get a variable called “species” that includes everything in the species column?
species <- penguin_data$species
  1. The function unique() will show you unique values only. How can I use that to see what the different species are?
unique(species)
  1. Use the function sum() to find out how much all the penguins weigh together.
sum(penguin_complete$body_mass_g)
  1. I want to find the ratio between bill length and bill depth. To do that I would need to divide one by the other and store it in a variable called ratio. How would I do that?
ratio <- penguin_complete$bill_length_mm/penguin_complete$bill_depth_mm