Delete duplicated rows, selecting specific values to retain

Question

I am working on a raw dataset that looks something like this:

df <- data.frame("ID" = c("Alpha", "Alpha", "Alpha", "Alpha", 
                                "Beta","Beta", "Beta","Beta" ), 
                 "Year" = c(1970, 1970, 1980, 1990, 1970, 1980, 
                            1980,1990), 
                 "Val" = c(1,0,1,1,0,0,0,1),
                 "Val2" = c(0,0,0,1,0,1,0,1),
                 "Val3" = c(0,2.34,1.3,0,0,2.34,3.2,1.3))

The data is a bit dirty as I have multiple observations for each ID and Year identifier - e.g. I have 2 different rows for Alpha in 1970. The same holds for Beta in 1980.

The issue is that the variable of interest Val Val2 Val3 have different scores in the duplicated rows (in terms of id/year).

I would like to find a concise way to produce the following final dataframe:

final<- data.frame("ID" = c("Alpha", "Alpha", "Alpha", 
                           "Beta", "Beta","Beta" ), 
                  "Year" = c( 1970, 1980, 1990, 1970, 
                             1980,1990), 
                  "Val" = c(1,1,1,0,0,1),
                  "Val2" = c(0,0,1,0,1,1),
                  "Val3" = c(2.34,1,0,0,3.2,1.3))

final

The logic is the following:

1) I want to have only one obs for every ID/year

2) For each variable of interest(Val Val2 Val3), I want only to retain the observation with the higher value in the single ID/year output

To illustrate. In df there is a duplicated observation for Alpha/1970. I want to reduce this to a single row. Val takes the value of 0 and 1, and in the final dataframe only 1 is retained. Val2 takes the value of 0 and 0, so I simply keep 0 in final. Val2 takes the value of 2.34 and 1, hence I want to retain in the final dataframe.

I guess I could do this one column at the time using this data.table command

library(dplyr)
val1<- df %>% select(ID:Val)
library(data.table)
setDT(val1)[order(-Val)][,.SD[1,], by = .(ID, Year)]
val2<- df %>% select(ID,Year,Val2)
library(data.table)
setDT(val2)[order(-Val2)][,.SD[1,], by = .(ID, Year)]

and then merge the values together but there are several columns for which I have to run the operation and I was looking for a more concise way to do it.

I hope it is sufficiently clear

Thanks a lot for your help in advance

akrun · Accepted Answer

Here is one option

library(dplyr)
df %>% 
  group_by(ID, Year) %>% 
  summarise_all(max)
  # for specific set of columns, use `summarise_at`
  # summarise_at(vars(starts_with("Val")), max)

or with data.table

library(data.table)
nm1 <- grep('^Val\d*$", names(df))
setDT(df)[, (nm1) := lapply(.SD, max), .(ID, Year), .SDcols = nm1]

Delete duplicated rows, selecting specific values to retain

Answers (2)

Related Questions