Reputation: 193
I have a dataset of about 3 million rows. I created a small example shown below:
ex <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63020,63020,63020,63020,63020,63020,63020,63020,61112,61112))
I have another dataset of 42 rows, but have produced a smaller example:
add_on <- data.table(primary = c(63020,61112), secondary=c(63035,63445))
I need to relabel certain rows on the "trigger_cpt" column (grouped by eoc) if the trigger_cpt value happens to be one of the values in the primary column of the dataset and if there is a proc1 value that is the secondary value in the add_on dataset. If it meets the criteria, then trigger_cpt should be relabeled to the secondary code.
I was first typing everything manually,
ex[,trigger_new := if(any(trigger_cpt == '63020' & proc1 == '63035')) 63035 else trigger_cpt, eoc]
and then decided to do a for loop
for(i in 1:nrow(add_on)){
ex[,trigger_new2 := if(any(trigger_cpt == add_on[i,1] & proc1 == add_on[i,2])) add_on[i,2] else trigger_cpt, eoc]
}
however, now that I'm trying this code on my 3 million row dataset, it's been taking a long time to run. I'm not sure if there's a better approach or if there's any modifications I could make to my current code?
any help would be greatly appreciated!
expected output:
ex_final <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63035,63035,63035,63035,63035,63020,63020,63020,61112,61112))
Upvotes: 0
Views: 1107
Reputation: 886968
Based on the expected output
ex[, trigger_new := first(proc1), eoc]
ex
eoc proc1 trigger_cpt trigger_new
1: 1 63035 63020 63035
2: 1 63020 63020 63035
3: 1 92344 63020 63035
4: 1 63035 63020 63035
5: 1 27567 63020 63035
6: 2 63020 63020 63020
7: 2 1234 63020 63020
8: 2 55678 63020 63020
9: 3 61112 61112 61112
10: 3 1236 61112 61112
Upvotes: 0
Reputation: 5429
Here is one way which produces a data.table that sets all of trigger_cpt to the secondary value if a match is found in the grouped set:
ex2 <- add_on[ex, , on=.(primary=trigger_cpt)][ , trigger_new := fifelse( secondary %in% proc1, secondary, NA_real_ ), by=eoc ]
ex.final <- ex2[ , trigger_cpt := fcoalesce( trigger_new, primary ) ][, .(eoc,proc1,trigger_cpt) ]
Output:
> ex.final
eoc proc1 trigger_cpt
1: 1 63035 63035
2: 1 63020 63035
3: 1 92344 63035
4: 1 63035 63035
5: 1 27567 63035
6: 2 63020 63020
7: 2 1234 63020
8: 2 55678 63020
9: 3 61112 61112
10: 3 1236 61112
Furthermore, if feasable (it comes at a cost), I consider using setkey
unless it causes more harm than good. (initial processing might make it not worth it). It speeds up downstream operations and it might make join code a lot cleaner . data.table code can be hard enough as it is. Thus:
setkey(ex, trigger_cpt )
setkey(add_on, primary )
## can now do this:
add_on[ex]
## instead of this:
add_on[ex, , on=.(primary=trigger_cpt)]
## .. in the code above.
... Furthermore ...
If you are reworking the steps of above, you will notice that add_on[ex]
(which is the somewhat backwards way to do left joins in data.table), leaves you with the key column names of add_on
, and not of ex
. This is not important, as long as you are aware and rename columns apropriately in the end, but an alternative way of joining the data might be this:
ex2 <- merge( ex, add_on, by.x="trigger_cpt", by.y="primary" )
## and then work your way till the end with what this gives you
Upvotes: 2