Zach
Zach

Reputation: 30311

Pivot a large data.table

I have a large data table in R:

library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
  ID=sample(1:200000, n, replace=TRUE), 
  Month=sample(1:12, n, replace=TRUE),
  Category=sample(1:1000, n, replace=TRUE),
  Qty=runif(n)*500,
  key=c('ID', 'Month')
)
dim(DT)

I'd like to pivot this data.table, such that Category becomes a column. Unfortunately, since the number of categories isn't constant within groups, I can't use this answer.

Any ideas how I might do this?

/edit: Based on joran's comments and flodel's answer, we're really reshaping the following data.table:

agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]

This reshape can be accomplished a number of ways (I've gotten some good answers so far), but what I'm really looking for is something that will scale well to a data.table with millions of rows and hundreds to thousands of categories.

Upvotes: 9

Views: 10486

Answers (4)

Arun
Arun

Reputation: 118799

data.table implements faster versions of melt/dcast data.table specific methods (in C). It also adds additional features for melting and casting multiple columns. Please see the Efficient reshaping using data.tables vignette.

Note that we don't need to load reshape2 package.

library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
  ID=sample(1:200000, n, replace=TRUE), 
  Month=sample(1:12, n, replace=TRUE),
  Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit
  Qty=runif(n),
  key=c('ID', 'Month')
)
dim(DT)

> system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum))
#   user  system elapsed
# 65.924  20.577  86.987
> dim(ans)
# [1] 2399401     802

Upvotes: 11

Christoph_J
Christoph_J

Reputation: 6884

EDIT

I found this SO post, which includes a better way to insert the missing rows into a data.table. Function fun_DT adjusted accordingly. Code is cleaner now; I don't see any speed improvements though.

See my update at the other post. Arun's solution works as well, but you have to manually insert the missing combinations. Since you have more identifier columns here (ID, Month), I only came up with a dirty solution here (creating an ID2 first, then creating all ID2-Category combination, then filling up the data.table, then doing the reshaping).

I'm pretty sure this isn't the best solution, but if this FR is built in, those steps might be done automatically.

The solutions are roughly the same speed wise, although it would be interesting to see how that scales (my machine is too slow, so I don't want to increase the n any further...computer crashed to often already ;-)

library(data.table)
library(rbenchmark)

fun_reshape <- function(n) {

  DT <- data.table(
    ID=sample(1:100, n, replace=TRUE), 
    Month=sample(1:12, n, replace=TRUE),
    Category=sample(1:10, n, replace=TRUE),
    Qty=runif(n)*500,
    key=c('ID', 'Month')
  )
  agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
  reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
          timevar = "Category", direction = "wide")
}

#UPDATED!
fun_DT <- function(n) {

  DT <- data.table(
    ID=sample(1:100, n, replace=TRUE), 
    Month=sample(1:12, n, replace=TRUE),
    Category=sample(1:10, n, replace=TRUE),
    Qty=runif(n)*500,
    key=c('ID', 'Month')
  ) 

  agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
  agg[, ID2 := paste(ID, Month, sep="_")]

  setkey(agg, ID2, Category)
  agg <- agg[CJ(unique(ID2), unique(Category))]

  agg[, as.list(setattr(Qty, 'names', Category)), by=list(ID2)]

}

library(rbenchmark)

n <- 1e+07
benchmark(replications=10,
          fun_reshape(n),
          fun_DT(n))
            test replications elapsed relative user.self sys.self user.child sys.child
2      fun_DT(n)           10  45.868        1    43.154    2.524          0         0
1 fun_reshape(n)           10  45.874        1    42.783    2.896          0         0

Upvotes: 2

mnel
mnel

Reputation: 115392

There is no data.table specific wide reshaping method.

Here is an approach that will work, but it is rather convaluted.

There is a feature request #2619 Scoping for LHS in :=to help with making this more straightforward.

Here is a simple example

# a data.table
DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6))
# with not all categories represented
DDD <- DD[1:5]
# trying to make `a` columns containing `cc`. retaining `b` as a column
# the unique values of `a` (you may want to sort this...)
nn <- unique(DDD[,a])
# create the correct wide data.table
# with NA of the correct class in each created column
rows <- max(DDD[, .N,  by = list(a,b)][,N])
DDw <- DDD[, setattr(replicate(length(nn), {
                     # safe version of correct NA  
                     z <- cc[1]
                      is.na(z) <-1
                     # using rows value calculated previously
                     # to ensure correct size
                       rep(z,rows)}, 
                    simplify = FALSE), 'names', nn),
           keyby = list(b)]
# set key for binary search
setkey(DDD, b, a)
# The possible values of the b column
ub <- unique(DDw[,b])
# nested loop doing things by reference, so should be 
# quick (the feature request would make this possible to 
# speed up using binary search joins.
for(ii in ub){
  for(jj in nn){
    DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]]
  }
}

DDw
#    b  d e  f
# 1: a  1 2  3
# 2: a  4 2  3
# 3: b NA 5 NA
# 4: b NA 5 NA

Upvotes: 3

flodel
flodel

Reputation: 89057

Like that?

agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]

reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
        timevar = "Category", direction = "wide")

Upvotes: 3

Related Questions