RSK
RSK

Reputation: 755

how to match two columns by group in R

I have a dataset like d1.

d1<-structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L), cname = structure(c(1L, 
1L, 1L, 2L, 3L, 2L, 2L, 3L, 1L), .Label = c("AA", "BB", "CC"), class = "factor"), 
value = c(1L, 2L, 1L, 2L, 2L, 1L, 2L, 3L, 1L), recentcname = c(NA, 
NA, NA, NA, NA, NA, NA, NA, NA)), .Names = c("id", "cname", 
"value", "recentcname"), class = "data.frame", row.names = c(NA, 
-9L))

Here my key variables are "id" and "value". For every individual id, I have to find out maximum value record in "value" column and take that corresponding "cname" string into "recentcname" column for that id. If we have two maximum values for one id, we have to take second highest record's "cname" string into "recentcname" column.

Finally my output will be like d2.

d2<-structure(list(id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L), cname = structure(c(1L, 
1L, 1L, 2L, 3L, 2L, 2L, 3L, 1L), .Label = c("AA", "BB", "CC"), class = "factor"), 
value = c(1L, 2L, 1L, 2L, 2L, 1L, 2L, 3L, 1L), recentcname = structure(c(1L, 
1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("AA", "CC"), class = "factor")), .Names = c("id", 
"cname", "value", "recentcname"), class = "data.frame", row.names = c(NA, 
-9L))

I can do it by splitting dataset by ib. But it is very time consumable. is there any other alternatives for this task. please help...

Upvotes: 2

Views: 116

Answers (2)

rnso
rnso

Reputation: 24623

Try:

dd = do.call(rbind, lapply(split(d1, d1$id), function(x)tail(x,1)))
names(dd)[2]= 'recentcname'
merge(d1[1:3], dd[1:2])
  id cname value recentcname
1  1    AA     1          AA
2  1    AA     2          AA
3  1    AA     1          AA
4  2    BB     2          CC
5  2    CC     2          CC
6  3    BB     1          CC
7  3    BB     2          CC
8  3    CC     3          CC
9  4    AA     1          AA

Upvotes: 1

MrFlick
MrFlick

Reputation: 206536

How about this

d1$recentcname <- unsplit(lapply(split(d1[,c("value","cname")], d1$id), function(x) {
    rep(tail(x$cname[x$value==max(x$value)],1), nrow(x))
}), d1$id)

we basically split the data by ID, then look for the last max value in each subset and repeat that value for each row in the subset. Then we use unsplit() to but the values back in the correct order corresponding to d1.

Upvotes: 2

Related Questions