lolo
lolo

Reputation: 646

R Data Frame + logic operation

I have the following consultation. Suppose I have the following base with two columns, one with the ID and one with the value.

What I need is to create an additional column (value_ok) with the following logic: For each ID that has the same letter, no ID with a higher value should have an amount greater than a lower value ID. If this happens, it should be replaced by the same value.

db<-data.frame(id=c("A_1","A_2","A_3","A_4","B_1","B_2","B_3","B_4","C_1","C_2","C_3","C_4","D_1","D_2","D_3","D_4","E_1","E_4"),
            value=c(10,9,8,7,7,8,9,5,15,30,14,20,10,10,10,20,30,40),
         value_ok=c(10,9,8,7,9,9,9,5,30,30,20,20,20,20,20,20,40,40))

Here is an example of the values and the final desired output.

    id value value_ok
1  A_1    10       10
2  A_2     9        9
3  A_3     8        8
4  A_4     7        7
5  B_1     7        9
6  B_2     8        9
7  B_3     9        9
8  B_4     5        5
9  C_1    15       30
10 C_2    30       30
11 C_3    14       20
12 C_4    20       20
13 D_1    10       20
14 D_2    10       20
15 D_3    10       20
16 D_4    20       20
17 E_1    30       40
18 E_4    40       40

Can someone helps me with this task?

Thanks!

Upvotes: 1

Views: 26

Answers (1)

Frank
Frank

Reputation: 66819

You can do...

library(data.table)
setDT(db)

db[.N:1, v := cummax(value), by=sub("^(.+)_(.+)$", "\\1", id)]

     id value value_ok  v
 1: A_1    10       10 10
 2: A_2     9        9  9
 3: A_3     8        8  8
 4: A_4     7        7  7
 5: B_1     7        9  9
 6: B_2     8        9  9
 7: B_3     9        9  9
 8: B_4     5        5  5
 9: C_1    15       30 30
10: C_2    30       30 30
11: C_3    14       20 20
12: C_4    20       20 20
13: D_1    10       20 20
14: D_2    10       20 20
15: D_3    10       20 20
16: D_4    20       20 20
17: E_1    30       40 40
18: E_4    40       40 40

The .N:1 sorts the table from last to first temporarily. by= groups rows. v := cummax(value) makes a new column with the cumulative max within each group.

The very ugly expression after by= is due to embedding of important information (the letter) inside a string. I would recommend never doing that. Here's a reference if you want to convert to something better: Split comma-separated strings in a column into separate rows

Upvotes: 3

Related Questions