Dataminer
Dataminer

Reputation: 1549

Reduce computation time

Most of the data sets that I have worked with has generally been of moderate size (mostly less than 100k rows) and hence my code's execution time has usually not been that big a problem for me.

But I was recently trying to write a function that takes 2 dataframes as arguments (with, say, m & n rows) and returns a new dataframe with m*n rows. I then have to perform some operations on the resulting data set. So, even with small values of m & n (say around 1000 each ) the resulting dataframe would have more than a million rows.

When I try even simple operations on this dataset, the code takes an intolerably long time to run. Specifically, my resulting dataframe has 2 columns with numeric values and I need to add a new column which will compare the values of these columns and categorize them as - "Greater than", "less than", "Tied"

I am using the following code:

df %>% mutate(compare=ifelse(var1==var2,"tied",
              ifelse(var1>var2,"Greater than","lesser then")

And, as I mentioned before, this takes forever to run. I did some research on this, and I figured out that apparently operations on data.table is significantly faster than dataframe, so maybe that's one option I can try.

But I have never used data.tables before. So before I plunge into that, I was quite curious to know if there are any other ways to speed up computation time for large data sets.

What other options do you think I can try?

Thanks!

Upvotes: 1

Views: 186

Answers (1)

doicomehereoften1
doicomehereoften1

Reputation: 567

For large problems like this I like to parallelize. Since operations on individual rows are atomic, meaning that the outcome of an operation on a particular row is independent of every other row, this is an "embarassingly parallel" situation.

library(doParallel)
library(foreach)

registerDoParallel() #You could specify the number of cores to use here. See the documentation.

df$compare <- foreach(m=df$m, n=df$n, .combine='c') %dopar% {
    #Borrowing from @nicola in the comments because it's a good solution.
    c('Less Than', 'Tied', 'Greater Than')[sign(m-n)+2]
}

Upvotes: 1

Related Questions