Reputation: 1549
Most of the data sets that I have worked with has generally been of moderate size (mostly less than 100k rows) and hence my code's execution time has usually not been that big a problem for me.
But I was recently trying to write a function that takes 2 dataframes as arguments (with, say, m & n rows) and returns a new dataframe with m*n rows. I then have to perform some operations on the resulting data set. So, even with small values of m & n (say around 1000 each ) the resulting dataframe would have more than a million rows.
When I try even simple operations on this dataset, the code takes an intolerably long time to run. Specifically, my resulting dataframe has 2 columns with numeric values and I need to add a new column which will compare the values of these columns and categorize them as - "Greater than", "less than", "Tied"
I am using the following code:
df %>% mutate(compare=ifelse(var1==var2,"tied",
ifelse(var1>var2,"Greater than","lesser then")
And, as I mentioned before, this takes forever to run. I did some research on this, and I figured out that apparently operations on data.table is significantly faster than dataframe, so maybe that's one option I can try.
But I have never used data.tables before. So before I plunge into that, I was quite curious to know if there are any other ways to speed up computation time for large data sets.
What other options do you think I can try?
Thanks!
Upvotes: 1
Views: 186
Reputation: 567
For large problems like this I like to parallelize. Since operations on individual rows are atomic, meaning that the outcome of an operation on a particular row is independent of every other row, this is an "embarassingly parallel" situation.
library(doParallel)
library(foreach)
registerDoParallel() #You could specify the number of cores to use here. See the documentation.
df$compare <- foreach(m=df$m, n=df$n, .combine='c') %dopar% {
#Borrowing from @nicola in the comments because it's a good solution.
c('Less Than', 'Tied', 'Greater Than')[sign(m-n)+2]
}
Upvotes: 1