Reputation: 30311
I have a large data table in R:
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:1000, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
dim(DT)
I'd like to pivot this data.table, such that Category becomes a column. Unfortunately, since the number of categories isn't constant within groups, I can't use this answer.
Any ideas how I might do this?
/edit: Based on joran's comments and flodel's answer, we're really reshaping the following data.table
:
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
This reshape can be accomplished a number of ways (I've gotten some good answers so far), but what I'm really looking for is something that will scale well to a data.table
with millions of rows and hundreds to thousands of categories.
Upvotes: 9
Views: 10486
Reputation: 118799
data.table
implements faster versions of melt/dcast
data.table specific methods (in C). It also adds additional features for melting and casting multiple columns. Please see the Efficient reshaping using data.tables vignette.
Note that we don't need to load reshape2
package.
library(data.table)
set.seed(1234)
n <- 1e+07*2
DT <- data.table(
ID=sample(1:200000, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:800, n, replace=TRUE), ## to get to <= 2 billion limit
Qty=runif(n),
key=c('ID', 'Month')
)
dim(DT)
> system.time(ans <- dcast(DT, ID + Month ~ Category, fun=sum))
# user system elapsed
# 65.924 20.577 86.987
> dim(ans)
# [1] 2399401 802
Upvotes: 11
Reputation: 6884
EDIT
I found this SO post, which includes a better way to insert the missing rows into a data.table. Function
fun_DT
adjusted accordingly. Code is cleaner now; I don't see any speed improvements though.
See my update at the other post. Arun's solution works as well, but you have to manually insert the missing combinations. Since you have more identifier columns here (ID, Month), I only came up with a dirty solution here (creating an ID2 first, then creating all ID2-Category combination, then filling up the data.table, then doing the reshaping).
I'm pretty sure this isn't the best solution, but if this FR is built in, those steps might be done automatically.
The solutions are roughly the same speed wise, although it would be interesting to see how that scales (my machine is too slow, so I don't want to increase the n any further...computer crashed to often already ;-)
library(data.table)
library(rbenchmark)
fun_reshape <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
}
#UPDATED!
fun_DT <- function(n) {
DT <- data.table(
ID=sample(1:100, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category=sample(1:10, n, replace=TRUE),
Qty=runif(n)*500,
key=c('ID', 'Month')
)
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
agg[, ID2 := paste(ID, Month, sep="_")]
setkey(agg, ID2, Category)
agg <- agg[CJ(unique(ID2), unique(Category))]
agg[, as.list(setattr(Qty, 'names', Category)), by=list(ID2)]
}
library(rbenchmark)
n <- 1e+07
benchmark(replications=10,
fun_reshape(n),
fun_DT(n))
test replications elapsed relative user.self sys.self user.child sys.child
2 fun_DT(n) 10 45.868 1 43.154 2.524 0 0
1 fun_reshape(n) 10 45.874 1 42.783 2.896 0 0
Upvotes: 2
Reputation: 115392
There is no data.table
specific wide reshaping method.
Here is an approach that will work, but it is rather convaluted.
There is a feature request #2619 Scoping for LHS in :=
to help with making this more straightforward.
Here is a simple example
# a data.table
DD <- data.table(a= letters[4:6], b= rep(letters[1:2],c(4,2)), cc = as.double(1:6))
# with not all categories represented
DDD <- DD[1:5]
# trying to make `a` columns containing `cc`. retaining `b` as a column
# the unique values of `a` (you may want to sort this...)
nn <- unique(DDD[,a])
# create the correct wide data.table
# with NA of the correct class in each created column
rows <- max(DDD[, .N, by = list(a,b)][,N])
DDw <- DDD[, setattr(replicate(length(nn), {
# safe version of correct NA
z <- cc[1]
is.na(z) <-1
# using rows value calculated previously
# to ensure correct size
rep(z,rows)},
simplify = FALSE), 'names', nn),
keyby = list(b)]
# set key for binary search
setkey(DDD, b, a)
# The possible values of the b column
ub <- unique(DDw[,b])
# nested loop doing things by reference, so should be
# quick (the feature request would make this possible to
# speed up using binary search joins.
for(ii in ub){
for(jj in nn){
DDw[list(ii), {jj} := DDD[list(ii,jj)][['cc']]]
}
}
DDw
# b d e f
# 1: a 1 2 3
# 2: a 4 2 3
# 3: b NA 5 NA
# 4: b NA 5 NA
Upvotes: 3
Reputation: 89057
Like that?
agg <- DT[, list(Qty = sum(Qty)), by = c("ID", "Month", "Category")]
reshape(agg, v.names = "Qty", idvar = c("ID", "Month"),
timevar = "Category", direction = "wide")
Upvotes: 3