Reputation: 1106
I have a large data.frame (>4M rows) in which one column contains character strings. I want to perform several string operations/match regular expressions on each text field (e.g. gsub
).
I'm wondering how I can speed up operations? Basically, I'm performing a bunch of
gsub(patternvector," [token] ",tweetDF$textcolumn)
gsub(patternvector," [token] ",tweetDF$textcolumn)
....
I'm running R on a 8GB RAM Mac and tried to move it to the cloud (Amazon EC2 large instance with ~64GB RAM), but it's not going very fast.
I've heard of the several packages (bigmemory
, ff
) and found an overview about High Performance/Parallel Computing for R here.
Does anyone have recommendations for a package most suitable for speeding up string operations? Or knows of a source explaining how apply the standard R string functions (gsub
,..) to the 'objects' created by these 'High Performance Computing packages' ?
Thanks for your help!
Upvotes: 1
Views: 591
Reputation: 13914
mclapply
or any other function that allows for parallel processing should speed up the task significantly. If you are not using parallel processing you are only using only 1 CPU, no matter how many CPUs your computer has available.
Upvotes: 1