Reputation: 13548

R reshaping melted data.table with list column

I have a large (millions of rows) melted data.table with the usual melt-style unrolling in the variable and value columns. I need to cast the table in wide form (rolling the variables up). The problem is that the data table also has a list column called data, which I need to preserve. This makes it impossible to use reshape2 because dcast cannot deal with non-atomic columns. Therefore, I need to do the rolling up myself.

The answer from a previous question about working with melted data tables does not apply here because of the list column.

I am not satisfied with the solution I've come up with. I'm looking for suggestions for a simpler/faster implementation.

x <- LETTERS[1:3]
dt <- data.table(
  x=rep(x, each=2),
  y='d',
  data=list(list(), list(), list(), list(), list(), list()),
  variable=rep(c('var.1', 'var.2'), 3),
  value=seq(1,6)
  )

# Column template set up
list_template <- Reduce(
  function(l, col) { l[[col]] <- col; l }, 
  unique(dt$variable),
  list())

# Expression set up
q <- substitute({
  l <- lapply(
    list_template, 
    function(col) .SD[variable==as.character(col)]$value)
  l$data = .SD[1,]$data
  l
}, list(list_template=list_template))

# Roll up
dt[, eval(q), by=list(x, y)]

   x y var.1 var.2   data
1: A d     1     2 <list>
2: B d     3     4 <list>
3: C d     5     6 <list>

Upvotes: 7

Answers (2)

Uwe

Reputation: 42592

This old question piqued my curiosity as data.table has been improved sigificantly since 2013.

However, even with data.table version 1.11.4

dcast(dt, x + y + data ~ variable)

still returns an error

Columns specified in formula can not be of type list

The workaround follows the general outline of jonsedar's answer :

Reshape the non-list columns from long to wide format
Aggregate the list column data grouped by x and y
Join the two partial results on x and y

but uses the features of the actual data.table syntax, e.g., the on parameter:

dcast(dt, x + y ~ variable)[
  dt[, .(data = .(first(data))), by = .(x, y)], on = .(x, y)]

   x y var.1 var.2   data
1: A d     1     2 <list>
2: B d     3     4 <list>
3: C d     5     6 <list>

The list column data is aggregated by taking the first element. This is in line with OP's code line

l$data = .SD[1,]$data

which also picks the first element.

Upvotes: 2

jonsedar

Reputation: 314

I have somewhat cheating method that might do the trick - importantly, I assume that each x,y,list combination is unique! If not, please disregard.

I'm going to create two separate datatables, the first which is dcasted without the data list objects, and the second which has only the unique data list objects and a key. Then just merge them together to get the desired result.

require(data.table)
require(stringr)
require(reshape2)

x <- LETTERS[1:3]
dt <- data.table(
  x=rep(x, each=2),
  y='d',
  data=list(list("a","b"), list("c","d")),
  variable=rep(c('var.1', 'var.2'), 3),
  value=seq(1,6)
  )


# First create the dcasted datatable without the pesky list objects:
dt_nolist <- dt[,list(x,y,variable,value)]
dt_dcast <- data.table(dcast(dt_nolist,x+y~variable,value.var="value")
                       ,key=c("x","y"))


# Second: create a datatable with only unique "groups" of x,y, list
dt_list <- dt[,list(x,y,data)]

# Rows are duplicated so I'd like to use unique() to get rid of them, but
# unique() doesn't work when there's list objects in the data.table.
# Instead so I cheat by applying a value to each row within an x,y "group" 
# that is unique within EACH group, but present within EVERY group.
# Then just simply subselect based on that unique value.
# I've chosen rank(), but no doubt there's other options

dt_list <- dt_list[,rank:=rank(str_c(x,y),ties.method="first"),by=str_c(x,y)]

# now keep only one row per x,y "group"
dt_list <- dt_list[rank==1]
setkeyv(dt_list,c("x","y"))

# drop the rank since we no longer need it
dt_list[,rank:=NULL]

# Finally just merge back together
dt_final <- merge(dt_dcast,dt_list)

Upvotes: 1

R reshaping melted data.table with list column

Answers (2)

Related Questions