Joining Tables by Variable

Many times you are not combining tables that are set up exactly the same way. Often you are using values in one table to pull specific values from another. That’s what the following commands will help you do: combine data based on key values in each dataset.

left_join()

A “left join” is a method for adding to an existing table. The use of the terms “left” and “right” for joining tables comes from pretending that you have both tables physically laid out next to each other, so one is on the left and the other on the right.

A left join will keep every item in the left table and add to it items from the right table. You will use a variable from the left table to match values of that same variable in the right table.

Here’s a couple simple fruity tables for joining:

If we want a dataset of our cost information with the colors added to it we can perform a left join. All we have to do is specify what our key is for matching the tables by. Since fruit appears in both tables that’s what we’ll use to look up values.

The syntax for left_join() is the name of the left dataset, the name of the right dataset, then the variable to join by.

You’ll see that we add the color column to the cost dataset. If there were more columns in color it would have added those too. It will pull all columns where any row matches the by argument.

You should also note that there is an NA value in the color column for peach. This is because peach was not listed in the color dataset so there was no value to grab.

right_join()

There is a very similar function called right_join() that you have probably already guessed how it operates. It will do the same thing as left joining, except it will keep everything in the right hand table and pull values from the left table.

Important: You will always list the left table then the right table. It’s the command itself that will tell R how to join them.

Here’s a right join of the two fruit datasets:

We can write out this function another way that’s often useful. If you pass a dataset to either join function, it will use that in place of the first table. So this works the same way as above for performing a right join:

Joining by Multiple Variables

You can also use multiple variables to join your datasets together. Here are two tables of fake weather data.

When combining them we want to be sure that we’re matching both city and month to retain accurate data. Now we’ll expand the by argument with the command join_by() which can handle multiple variables. Here’s the code to perform a left join of weather to temp matching by city and month: >Note that with join_by() arguments you just list the column name but don’t use quotes.

You should see the added columns of precip and wind that match the corresponding city and month. Seattle’s February weather data contains NA values since there is no weather data for Seattle in February.

Joining Mismatched Names

One nice feature of the join functions is that your data doesn’t have to have exactly matching names for you to be able to use that column as a lookup key. Here’s the same weather table as above but this time city is called location:

We can still join these tables as before by adding additional information to the join_by() command. Now we state the name of the column in the left table and then use == to match it to the name of the column in the right table. Note that since we’re looking for character matches, we need to use quotations. > The quotations on or off thing is annoying to remember. Trial and error always works.

Your output should be the same as before when both columns were called city.