Cleaning & Wrangling

Think Ahead

When you’re doing an analysis you should start with goals in mind, including what kind of plots will be most informative.

What do we want to do with this data?

see completion rates across time and across colleges
see if there are differences between race and sex
see raw number of graduates for Reed to see how we’re doing over time
make plots to show interesting trends
show statistical differences
predict future graduation rates

Pseudocoding

Pseudocoding is writing out your plan in words before you code. It’s how you plan out your code and break tasks down into individual chunks. For example, “load the data” is pseudocode. Pseudocode is often written as comments throughout the code to help explain what each step does.

If you can write good pseudocode, then you can code. Thinking through the steps is the actual bulk of coding, the rest is just looking up a command that does what you want to do. Unless you’re in a CS class, coding is open book. You can always use Google (or AI) to find out functions or syntax. Don’t worry about memorizing everything.

If you’re going to vibe code, meaning use an AI to help you write code, all you really need to give it is the pseudocode. If you can break things down into simple enough steps, the AI can help you fill in the gaps.

Quick note on AI, it’s most useful when you have no idea what you’re doing (because it can walk you through the basics slowly) and when you really know what you’re doing. If you’re inbetween those it can be hard to tell if it’s giving you something correct or just something that looks correct. It’s great for explaining what code that already works is doing (ex: code your professor gives you) and for explaining R’s terrible error messages. Use it thoughtfully.

What We’ll Do

First let’s clean up our dataset a little. Let’s just keep the data that will be pertinent to our analysis so that our spreadsheet is easier to look at. We’ll also create a separate dataset just for Reed to analyze it separately. Let’s write some pseudocode to help us get our dataset into good shape.

Operators & Functions We’re Using

For more on any of these functions see the Cleaning & Wrangling: Part 1 workshop.

Pipe (|> or %>%)

A pipe in R is an operator that allows you to pass the output of one function directly into the next function without needing intermediate variables. With a pipe we don’t have to create new variables, we can take the output of the first line and pass it directly to a new function.

The |> should always be the last thing on the line and any new functions should start on the next line. You should think of the pipe like saying “and” in a conversation.

The symbol for piping used to be %>% but has been updated to be |>. Depending on whose code you’re looking at, you could see either one. We will use |>, but both work the exact same way for 99.5% of things.

Selecting and Deleting Columns

If you want to select several columns or delete columns you will use the command select(). You just put the columns you want to keep inside select(). If you have more than one, you will use the list function c().

If you want to delete a column, you use select() but you put a minus - in front the column(s) you want to remove from the output.

Renaming Columns

The function rename() does what it sounds like: it changes the name of a specified column. The syntax is to pick a new column name and set it equal to the old column name.

If we want to rename two columns, we can do both inside the same function, just separated by a comma:

data |>
  rename(new_col1 = col1,
         new_col2 = col2)

Using a comma is different than using a pipe because the comma is only for if you are inside a single function and pipe is for stringing together multiple functions.

Mutating (Changing or Adding) Columns

The mutate() function let’s us add or change columns. We can add a new column simply by giving it a name and a definition.

data |>
  mutate(percent = col1 / col2)

Remember, if the columns are all numbers, you can perform mathematical operations just using their name.

You can change an existing column by overwriting it.

data |>
  mutate(col1 = col1 * 1000)

Filtering (Keeping/Removing Based on a Condition)

Filtering is a way to subset a dataset by rows and keep only the rows that meet a certain condition. To do this, we use the filter() function, which will remove rows that don’t satisfy the condition in the parentheses.

If you use == you will keep all values that match your variable. If you use != you will keep all values that don’t match your variables. You can also use < and > if you are filtering numerical columns.

data |>
  filter(year < 2000)