Reputation: 2675
I'm attempting to update a data.table based on values in another table. I thought I had an approach (although not particularly efficient given my actual dt2 has 3.5M records) but it turns out something is going wrong with my code
In DT1 I have a number of variables (numeric) that I am wanting to set to 1 or 0 depending on what their classification value is.
Example of data:
dt1 <- data.table(urn = 1:10, V1=0,V2=0,V3=0)
## urn V1 V2 V3
## 1: 1 0 0 0
## 2: 2 0 0 0
## 3: 3 0 0 0
## 4: 4 0 0 0
## 5: 5 0 0 0
## 6: 6 0 0 0
## 7: 7 0 0 0
## 8: 8 0 0 0
## 9: 9 0 0 0
##10: 10 0 0 0
dt2 <- data.table(urn=rep(1:10,2),classification=0)
dt2$classification <- 1:7 #does give a warning message
## urn classification
## 1: 1 1
## 2: 2 2
## 3: 3 3
## 4: 4 4
## 5: 5 5
## 6: 6 6
## 7: 7 7
## 8: 8 1
## 9: 9 2
##10: 10 3
##11: 1 4
##12: 2 5
##13: 3 6
##14: 4 7
##15: 5 1
##16: 6 2
##17: 7 3
##18: 8 4
##19: 9 5
##20: 10 6
What I want to happen is where the urn matches, and the urn has a record with classification of 1, then set the V1 flag to 1; where the urn has a classification of 2, set V2 = 1; you get the idea.
My current approach has been:
setkey(dt1, urn)
setkey(dt2, urn)
dt1[dt2, V1:= ifelse(i.classification == 1 , 1, dt1$V1)]
dt1[dt2, V2:= ifelse(i.classification == 2, 1, dt1$V2)]
dt1[dt2, V3:= ifelse(i.classification == 6, 1, dt1$V3)]
dt1
## urn V1 V2 V3
## 1: 1 1 0 0
## 2: 2 1 0 0
## 3: 3 1 0 1
## 4: 4 1 0 0
## 5: 5 1 0 0
## 6: 6 1 1 0
## 7: 7 1 0 0
## 8: 8 1 0 0
## 9: 9 1 0 0
##10: 10 1 0 1
What I actually expect the output to look like is:
## urn V1 V2 V3
## 1: 1 1 0 0
## 2: 2 0 1 0
## 3: 3 0 0 1
## 4: 4 0 0 0
## 5: 5 1 0 0
## 6: 6 0 1 1
## 7: 7 0 0 0
## 8: 8 1 0 0
## 9: 9 0 1 0
##10: 10 0 0 1
Any help would be greatly appreciated. Particularly if I am able to simplify the command to manage all of these in a single scan rather than 3 joins of data. As I said 3.5M records adds a bit of overhead (particularly when I'll actually be setting flags for about 10 columns).
Thanks in advance.
Upvotes: 2
Views: 867
Reputation: 22333
You can use dcast.data.table
for this.
dcast.data.table(data = dt1[dt2,],
formula = urn ~ classification,
fun.aggregate = function(x) as.numeric(any(!is.na(x))),
value.var = 'classification')[, list(urn, V1=`1`, V2=`2`, V3=`6`)]
## urn V1 V2 V3
## 1: 1 1 0 0
## 2: 2 0 1 0
## 3: 3 0 0 1
## 4: 4 0 0 0
## 5: 5 1 0 0
## 6: 6 0 1 1
## 7: 7 0 0 0
## 8: 8 1 0 0
## 9: 9 0 1 0
## 10: 10 0 0 1
EDIT: You can wrap the last statement in tryCatch
functions if necessary:
dcast.data.table(data = dt1[dt2,],
formula = urn ~ classification,
fun.aggregate = function(x) as.numeric(any(!is.na(x))),
value.var = 'classification')[, list(urn,
V1=tryCatch(`1`, error = ## function(e) NA),
V2=tryCatch(`2`, error = function(e) NA),
V3=tryCatch(`8`, error = function(e) NA))]
## urn V1 V2 V3
## 1: 1 1 0 NA
## 2: 2 0 1 NA
## 3: 3 0 0 NA
## 4: 4 0 0 NA
## 5: 5 1 0 NA
## 6: 6 0 1 NA
## 7: 7 0 0 NA
## 8: 8 1 0 NA
## 9: 9 0 1 NA
## 10: 10 0 0 NA
Upvotes: 2