Reputation: 3035
In R, I'm looking for a memory-efficient way to create a summary of tabular data as follows.
Take for example the data.frame
foo
which I've used table()
to summarize, followed by as.data.frame()
to obtain the frequency counts.
foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- as.data.frame(table(foo), stringsAsFactors=F)
This results in the following frequency count for bar
x y Freq
1 a ab 1
2 b ab 0
3 a ac 1
4 b ac 0
5 a ad 1
6 b ad 0
7 a ae 0
8 b ae 1
9 a fx 0
10 b fx 1
11 a fy 0
12 b fy 1
The problem I'm running into is when there are many levels of x
and y
, it starts using up significant amounts of memory >64 GB. I was wondering if there was an alternative way of doing this kind of frequency count. As a first step, I set stringsAsFactors=F
, however this doesn't completely solve the problem.
Upvotes: 4
Views: 5681
Reputation: 103898
I have this method for fast (sparse) cross tabulation. I think there are possibilities for further optimisation, but it's been good enough for me for large data sets. The key is the use of ninteraction
from the plyr
package to quickly generate a numeric id for each row.
tab <- function(df, drop = TRUE) {
id <- plyr::ninteraction(df)
ord <- order(id)
df <- df[ord, , drop = FALSE]
id <- id[ord]
freq <- rle(id)$lengths
labels <- unrowname(df[cumsum(freq), , drop = FALSE])
data.frame(labels, freq)
}
Upvotes: 4
Reputation: 25337
Look at the xtabs
method in the Matrix
package which does sparse cross-tabulation.
Upvotes: 1