vizidea
vizidea

Reputation: 193

R loop using data.table

I have a dataset of about 3 million rows. I created a small example shown below:

ex <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63020,63020,63020,63020,63020,63020,63020,63020,61112,61112))

I have another dataset of 42 rows, but have produced a smaller example:

add_on <- data.table(primary = c(63020,61112), secondary=c(63035,63445))

I need to relabel certain rows on the "trigger_cpt" column (grouped by eoc) if the trigger_cpt value happens to be one of the values in the primary column of the dataset and if there is a proc1 value that is the secondary value in the add_on dataset. If it meets the criteria, then trigger_cpt should be relabeled to the secondary code.

I was first typing everything manually, ex[,trigger_new := if(any(trigger_cpt == '63020' & proc1 == '63035')) 63035 else trigger_cpt, eoc]

and then decided to do a for loop

for(i in 1:nrow(add_on)){
  ex[,trigger_new2 := if(any(trigger_cpt == add_on[i,1] & proc1 == add_on[i,2])) add_on[i,2] else trigger_cpt, eoc]
}

however, now that I'm trying this code on my 3 million row dataset, it's been taking a long time to run. I'm not sure if there's a better approach or if there's any modifications I could make to my current code?

any help would be greatly appreciated!

expected output:

ex_final <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63035,63035,63035,63035,63035,63020,63020,63020,61112,61112))

Upvotes: 0

Views: 1107

Answers (2)

akrun
akrun

Reputation: 886968

Based on the expected output

ex[, trigger_new := first(proc1), eoc]




ex
    eoc proc1 trigger_cpt trigger_new
 1:   1 63035       63020       63035
 2:   1 63020       63020       63035
 3:   1 92344       63020       63035
 4:   1 63035       63020       63035
 5:   1 27567       63020       63035
 6:   2 63020       63020       63020
 7:   2  1234       63020       63020
 8:   2 55678       63020       63020
 9:   3 61112       61112       61112
10:   3  1236       61112       61112

Upvotes: 0

Sirius
Sirius

Reputation: 5429

Here is one way which produces a data.table that sets all of trigger_cpt to the secondary value if a match is found in the grouped set:


ex2 <- add_on[ex, , on=.(primary=trigger_cpt)][ , trigger_new := fifelse( secondary %in% proc1, secondary, NA_real_ ), by=eoc ]
ex.final  <- ex2[ , trigger_cpt := fcoalesce( trigger_new, primary ) ][, .(eoc,proc1,trigger_cpt) ]

Output:


> ex.final
    eoc proc1 trigger_cpt
 1:   1 63035       63035
 2:   1 63020       63035
 3:   1 92344       63035
 4:   1 63035       63035
 5:   1 27567       63035
 6:   2 63020       63020
 7:   2  1234       63020
 8:   2 55678       63020
 9:   3 61112       61112
10:   3  1236       61112

Furthermore, if feasable (it comes at a cost), I consider using setkey unless it causes more harm than good. (initial processing might make it not worth it). It speeds up downstream operations and it might make join code a lot cleaner . data.table code can be hard enough as it is. Thus:


setkey(ex, trigger_cpt )
setkey(add_on, primary )

## can now do this:
add_on[ex]

## instead of this:
add_on[ex, , on=.(primary=trigger_cpt)]

## .. in the code above.

... Furthermore ...

If you are reworking the steps of above, you will notice that add_on[ex] (which is the somewhat backwards way to do left joins in data.table), leaves you with the key column names of add_on, and not of ex. This is not important, as long as you are aware and rename columns apropriately in the end, but an alternative way of joining the data might be this:


ex2 <- merge( ex, add_on, by.x="trigger_cpt", by.y="primary" )
## and then work your way till the end with what this gives you

Upvotes: 2

Related Questions