beginneR
beginneR

Reputation: 3291

merge-like scenario with two data.tables

I have two dataframes (actually data.tables).

set.seed(123)
dt1 <- data.table(P=rep(letters[1:3],c(4,2,3)),X=sample(9))
dt1
   P X
1: a 3
2: a 7
3: a 9
4: a 6
5: b 5
6: b 1
7: c 2
8: c 8
9: c 4

and:

dt2 <- data.table(P=rep(letters[1:5],length=10),D=c("X","Y","Z","G","F"))
dt2
    P D
 1: a X
 2: b Y
 3: c Z
 4: d G
 5: e F
 6: a X
 7: b Y
 8: c Z
 9: d G
10: e F

Now I want to add a new column to dt1, with column "D" of dt2 where P has the same value in dt1 and dt2. It should look like this:

dt_new
   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z

Upvotes: 3

Views: 152

Answers (2)

Matt Dowle
Matt Dowle

Reputation: 59602

+1 to Arun's answer. To show the update-by-reference way to do this ...

Example data from question again:

set.seed(123)
dt1 = data.table(P=rep(letters[1:3],c(4,2,3)),X=sample(9))
dt2 = data.table(P=rep(letters[1:5],length=10),D=c("X","Y","Z","G","F"))

Removed dups in example data using unique() as Arun did :

dt2 = unique(dt2)  
dt2
   P D
1: a X
2: b Y
3: c Z
4: d G
5: e F

Now add D by reference to dt1 with data from dt2. Like a foreign key in SQL. Admittedly this syntax isn't obvious or particularly elegant but it does avoid the copy of dt1. So it can be significantly faster if dt1 is say 10GB in size.

setkey(dt2, P)
dt1[,D:={ .P=P           # allows us to refer to the P from dt1 on next line
          dt2[.P,D]$D}]  # since P is type character, no need to J() or .()
dt1
   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z

Or, keeping the duplicates in dt2:

set.seed(123)
dt1 = data.table(P=rep(letters[1:3],c(4,2,3)),X=sample(9))
dt2 = data.table(P=rep(letters[1:5],length=10),D=c("X","Y","Z","G","F"))
setkey(dt2,P)
dt2
    P D
 1: a X
 2: a X
 3: b Y
 4: b Y
 5: c Z
 6: c Z
 7: d G
 8: d G
 9: e F
10: e F
dt1[,D:={ .P=P
          dt2[.P,D,mult="first"]}]
dt1
   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z

Upvotes: 6

Arun
Arun

Reputation: 118779

I'd do a data.table join in this manner:

setkey(dt1, P)
dt1[unique(dt2),nomatch=0]

   P X D
1: a 3 X
2: a 7 X
3: a 9 X
4: a 6 X
5: b 5 Y
6: b 1 Y
7: c 2 Z
8: c 8 Z
9: c 4 Z

Upvotes: 7

Related Questions