Kerry
Kerry

Reputation: 411

melt data table and split values

I have a column in a data table which is a list of comma separated values

dt = data.table( a = c('a','b','c'), b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt'))
> dt
   a           b
1: a    xx,yy,zz
2: b       mm,nn
3: c qq,rr,ss,tt

I would like to transform it into a long format

   a  b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt

This question has been answered for a data frame here. I'm wondering if there is an elegant data table solution.

Upvotes: 0

Views: 832

Answers (3)

lmo
lmo

Reputation: 38510

The following will work for your example:

dt[, c(b=strsplit(b, ",")), by=a]
   a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt

This method fails if the "by" variable is repeated as in

dt = data.table(a = c('a','b','c', 'a'),
                b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))

One robust solution in this situation can be had by using paste to collapse all observations with the same grouping variable (a) and feeding the result to the code above.

dt[, .(b=paste(b, collapse=",")), by=a][, c(b=strsplit(b, ",")), by=a]

This returns

    a  b
 1: a xx
 2: a yy
 3: a zz
 4: a zz
 5: a gg
 6: a tt
 7: b mm
 8: b nn
 9: c qq
10: c rr
11: c ss
12: c tt

Upvotes: 4

Rilgar17
Rilgar17

Reputation: 181

There is another method, but this method involves another package : splitstackshape.

library(splitstackshape)
cSplit(dt, "b", sep = ",", direction = "long")
   a  b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt

This function uses data.table to work. And this work even if we have multiple same value for the column "a".

Upvotes: 1

akrun
akrun

Reputation: 887691

We can split the column 'b' by the delimiter ',' (using strsplit), grouped by 'a' and set the name of the new column i.e. 'V1' to 'b' with setnames

setnames(dt[, strsplit(b, ','), by = a], "V1", "b")[]
#   a  b
#1: a xx
#2: a yy
#3: a zz
#4: b mm
#5: b nn
#6: c qq
#7: c rr
#8: c ss
#9: c tt

If there are repeating elements in 'a' as in the below example

dt <- data.table(a = c('a','b','c', 'a'),
            b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))

we can group by the sequence of rows, do the strsplit on 'b', concatenate with the 'a' column and assign (:=) the 'grp' to NULL

dt[, c(a=a, b=strsplit(b, ",")), .(grp = 1:nrow(dt))][, grp := NULL][]
#    a  b
# 1: a xx
# 2: a yy
# 3: a zz
# 4: b mm
# 5: b nn
# 6: c qq
# 7: c rr
# 8: c ss
# 9: c tt
#10: a zz
#11: a gg
#12: a tt

NOTE: Both the methods are data.table methods

Upvotes: -2

Related Questions