Gustavo Silva
Gustavo Silva

Reputation: 159

R code incredibly slow

Recently I have been working on some R scripts to do some reports. One of the tasks involved is to check if a value in a column matches any row of another dataframe. If this is true, then set a new column with logical TRUE/FALSE.

More specifically, I need help improving this code chunk:

for (i in 1:length(df1$Id)) {
  df1 <- within(df1, newCol <- df1$Id %in% df2$Id)
}
df1$newCol <- as.factor(df1$newCol)

The dataset has about 10k rows so it does not make sense to need 6 minutes (tested with proc.time() to execute it completely, which is what it is currently happening. Also, I need to do so other types of checking, so I really need to get this right.

What am I doing wrong there that is devouring time to accomplish?

Thank you for your help!

Upvotes: 1

Views: 107

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145775

Your code is vectorized - there is no need for the for loop. In this case, you can tell because you don't even use i inside the loop. This means your loop is executing the exact same code for the exact same result 10k times. If you delete the for wrapper around your functional line

df1 <- within(df1, newCol <- df1$Id %in% df2$Id)

you should get ~10k times speed-up.

One other comment is that the point of within is to avoid re-typing a data frame's name inside. So you're missing the point by using df1$ inside within(), and your data frame name is so short that it is longer to type within() in this case. Your entire code could be simplified to one line:

df1$newCol = factor(df1$Id %in% df2$Id)

My last comment I'm making from a state of ignorance about your application, so take it with a grain of salt, but a binary variable is almost always nicer to have as boolean (TRUE/FALSE) or integer (1/0) than as a factor. It does depend what you're doing with it, but I would leave the factor() off until necessary.

Upvotes: 9

Related Questions