HeyJane
HeyJane

Reputation: 143

Efficient way to handle big data in R

I have a huge csv-file, 1.37 GB, and when running my glm in R, it crashes because I do not have enough memory allocated. You know, the regular error..

Are there no alternative to packages ff and bigmemory, because they do not seem to work well for me, because my columns are a mix of integer and characters, and it seems with the two packages I have to specify what type my columns are, either char or integer.

We are soon in 2018 and about to put people on Mars; are there no simple "read.csv.xxl" function we can use?

Upvotes: 1

Views: 726

Answers (1)

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522787

I would first address your question by recognizing that just because your sample data takes 1.37 GB does not at all mean that 1.37 GB would be satisfactory to do all your calculations using the glm package. Most likely, one of your calculations could spike at at least a multiple of 1.37 GB.

For the second part, a practical workaround here would be to just take a reasonable sub sample of your 1.37 GB data set. Do you really need to build your model using all the data points in the original data set? Or, would say a 10% sub sample also give you a statistically significant model? If you lower the size of the data set, then you solve the memory problem with R.

Keep in mind here that R runs completely in-memory, meaning that once you have exceeded available memory, you may be out of luck.

Upvotes: 1

Related Questions