Reputation: 3035

as.data.frame of table() to summarize frequencies

In R, I'm looking for a memory-efficient way to create a summary of tabular data as follows.

Take for example the data.frame foo which I've used table() to summarize, followed by as.data.frame() to obtain the frequency counts.

foo <- data.frame(x= c('a', 'a', 'a', 'b', 'b', 'b'), y=c('ab', 'ac', 'ad', 'ae', 'fx', 'fy'))
bar <- as.data.frame(table(foo), stringsAsFactors=F)

This results in the following frequency count for bar

   x  y Freq
1  a ab    1
2  b ab    0
3  a ac    1
4  b ac    0
5  a ad    1
6  b ad    0
7  a ae    0
8  b ae    1
9  a fx    0
10 b fx    1
11 a fy    0
12 b fy    1

The problem I'm running into is when there are many levels of x and y, it starts using up significant amounts of memory >64 GB. I was wondering if there was an alternative way of doing this kind of frequency count. As a first step, I set stringsAsFactors=F, however this doesn't completely solve the problem.

Upvotes: 4

Answers (3)

hadley

Reputation: 103898

I have this method for fast (sparse) cross tabulation. I think there are possibilities for further optimisation, but it's been good enough for me for large data sets. The key is the use of ninteraction from the plyr package to quickly generate a numeric id for each row.

tab <- function(df, drop = TRUE) {
  id <- plyr::ninteraction(df)
  ord <- order(id)

  df <- df[ord, , drop = FALSE]
  id <- id[ord]

  freq <- rle(id)$lengths
  labels <- unrowname(df[cumsum(freq), , drop = FALSE])

  data.frame(labels, freq)
}

Upvotes: 4

Ian Fellows

Reputation: 17348

library(plyr)
ddply(foo, ~ x + y, nrow,.drop=FALSE)

Upvotes: 1

Jonathan Chang

Reputation: 25337

Look at the xtabs method in the Matrix package which does sparse cross-tabulation.

Upvotes: 1

as.data.frame of table() to summarize frequencies

Answers (3)

Related Questions