yumba
yumba

Reputation: 1106

R: String Operations on Large Data Set (How to speed up?)

I have a large data.frame (>4M rows) in which one column contains character strings. I want to perform several string operations/match regular expressions on each text field (e.g. gsub).

I'm wondering how I can speed up operations? Basically, I'm performing a bunch of

gsub(patternvector," [token] ",tweetDF$textcolumn)
gsub(patternvector," [token] ",tweetDF$textcolumn)
....

I'm running R on a 8GB RAM Mac and tried to move it to the cloud (Amazon EC2 large instance with ~64GB RAM), but it's not going very fast.

I've heard of the several packages (bigmemory, ff) and found an overview about High Performance/Parallel Computing for R here.

Does anyone have recommendations for a package most suitable for speeding up string operations? Or knows of a source explaining how apply the standard R string functions (gsub,..) to the 'objects' created by these 'High Performance Computing packages' ?

Thanks for your help!

Upvotes: 1

Views: 591

Answers (1)

Michael
Michael

Reputation: 13914

mclapply or any other function that allows for parallel processing should speed up the task significantly. If you are not using parallel processing you are only using only 1 CPU, no matter how many CPUs your computer has available.

Upvotes: 1

Related Questions