Andre
Andre

Reputation: 125

Running out of memory with merge

I have a paneldata which looks like:

(Only the substantially cutting for my question)

Persno 122 122 122 333 333 333 333 333 444 444 
Income 1500 1500 2000 2000 2100 2500 2500 1500 2000 2200
year 1990 1991 1992 1990 1991 1992 1993 1994 1992 1993

Now I would like to give out for every row (PErsno) the years of workexperience at the begining of the year. I use ddply

hilf3<-ddply(data, .(Persn0), summarize, Bgwork = 1:(max(year) - min(year)))

To produce output looking like this:

Workexperience: 1 2 3 1 2 3 4 5 1 2

Now I want to merge the ddply results to my original panel data:

data<-(merge(data,hilf3,by.x="Persno",by.y= "Persno"))

The panel data set is very large. The code stops because of a memory size error.

Errormessage:

1: In make.unique(as.character(rows)) :

Reached total allocation of 4000Mb: see help(memory.size)

What should I do?

Upvotes: 6

Views: 8068

Answers (4)

Connor Harris
Connor Harris

Reputation: 431

Since this question was posted, the data.table package has provided a re-implementation of data frames and a merge function that I have found to be much more memory-efficient than R's default. Converting the default data frames to data tables with as.data.table may avoid memory issues.

Upvotes: 1

otsaw
otsaw

Reputation: 1094

Re-reading your question, I think you don't actually want to use merge here at all. Just sort your original data frame and rbind Bgwork from hilf3. And also, your ddply-call could perhaps result in a 1:0 sequence, which is most likely not what you want. Try

data = data[order(data$Persno, data$year),]
hilf3 = ddply(data, .(Persno), summarize, Bgwork=(year - min(year) + 1))
stopifnot(nrow(data) == nrow(hilf3))
stopifnot(all(data$Persno == hilf3$Persno))
data$Bgwork = hilf3$Bgwork

Upvotes: 5

otsaw
otsaw

Reputation: 1094

If you need to merge large data frames in R, one good option is to do it in pieces of, say 10000 rows. If you're merging data frames x and y, loop over 10000-row pieces of x, merge (or rather use plyr::join) with y and immediately append these results to a sigle csv-file. After all pieces have been merged and written to file, read that csv-file. This is very memory-efficient with proper use of logical index vectors and well placed rm and gc calls. It's not fast though.

Upvotes: 2

richiemorrisroe
richiemorrisroe

Reputation: 9513

Well, perhaps the surest way of fixing this is to get more memory. However, this isn't always an option. What you can do is somewhat dependent on your platform. On Windows, check the results of memory.size()and compare this to your available RAM. If memory size is lower than RAM then you can increase it. This is not an option on linux, as by default it will show all of your memory.

Another issue that can complicate matters is whether or not you are running a 32bit or 64bit system, as 32bit windows can only address up to a certain amount of RAM (2-4GB) depending on settings. This is not an issue if you are using 64bit Windows 7, which can address far more memory.

A more practical solution is to eliminate all unnecessary objects from your workspace before performing merge. You should run gc() to see how much memory you have and are using, and also to remove any objects which have no more references. Personally, I would probably run your ddply() from a script, then save the resulting dataframe as a CSV file, close your workspace and reopen it and then perform the merge again.

Finally the worst possible option (but which does require a whole lot less memory) is to create a new dataframe, and use the subsetting commands in R to copy the columns you want over, one by one. I really don't recommend this as it is tiresome and error prone, but I have had to do it once when there was no way to complete my analysis otherwise (i ended up investing in a new computer with more RAM shortly afterwards).

Hope this helps.

Upvotes: 5

Related Questions