Matt Chambers
Matt Chambers

Reputation: 2346

Data.table left join and aggregate/concatenate/group_concat

I have tables like:

x = data.table(Id=c(1,1,2,3,3,4), Name=c("A", "A", "B", "C", "C", "D"), TxId=c(10, 11, 20, 30, 31, 40))
#Id Name TxId
#1:  1    A   10
#2:  1    A   11
#3:  2    B   20
#4:  3    C   30
#5:  3    C   31
#6:  4    D   40

y = data.table(Name=c("A", "B", "B", "C"), Family=c("A-alpha", "B-beta", "B-gamma", "C-delta"))
#   Name  Family
#1:    A A-alpha
#2:    B  B-beta
#3:    B B-gamma
#4:    C C-delta

I can do the left-join and concatenation, but I only want one output row for each row in X.

# Left join X to Y on Name column
xy = y[x, on="Name"]
#   Name  Family Id TxId
#1:    A A-alpha  1   10
#2:    A A-alpha  1   11
#3:    B  B-beta  2   20
#4:    B B-gamma  2   20
#5:    C C-delta  3   30
#6:    C C-delta  3   31
#7:    D      NA  4   40

# Concatenate Family column
xy[, Family:=paste0(Family, collapse=", "), by=c("Name", "TxId")]
#   Name          Family Id TxId
#1:    A         A-alpha  1   10
#2:    A         A-alpha  1   11
#3:    B B-beta, B-gamma  2   20
#4:    B B-beta, B-gamma  2   20
#5:    C         C-delta  3   30
#6:    C         C-delta  3   31
#7:    D              NA  4   40

How do I get rid of the extra row for B? I want it to be unique on Id/TxId. i.e.

#   Name          Family Id TxId
#1:    A         A-alpha  1   10
#2:    A         A-alpha  1   11
#3:    B B-beta, B-gamma  2   20
#5:    C         C-delta  3   30
#6:    C         C-delta  3   31
#7:    D              NA  4   40

If I do as eddi comments:

xy[, .(Family=paste0(Family, collapse=", "), by=c("Name", "TxId")])

I get the correct result. But if I try to add other columns, it doesn't work (I get the same result as if I had done the := version):

xy[, .(Id, Family=paste0(Family, collapse=", ")), by=c("Name", "TxId")]

Upvotes: 0

Views: 274

Answers (1)

Uwe
Uwe

Reputation: 42592

Please, try

xy[, .(Family = paste0(Family, collapse = ", "), by = c("Id", "Name", "TxId")]

I'll try to explain:
If Id is part of the group-by it will only appear once for each unique value of Id (to be exact, for each unique combination of Id, Name, TxId). If Id is included in the j-expression, i.e., .(Id, Family = paste0(Family, collapse = ", ") then every row of Id will be included in the result set despite Family is being aggregated.

Upvotes: 1

Related Questions