Reputation: 1304
I am working on a raw dataset that looks something like this:
df <- data.frame("ID" = c("Alpha", "Alpha", "Alpha", "Alpha",
"Beta","Beta", "Beta","Beta" ),
"Year" = c(1970, 1970, 1980, 1990, 1970, 1980,
1980,1990),
"Val" = c(1,0,1,1,0,0,0,1),
"Val2" = c(0,0,0,1,0,1,0,1),
"Val3" = c(0,2.34,1.3,0,0,2.34,3.2,1.3))
The data is a bit dirty as I have multiple observations for each ID and Year identifier - e.g. I have 2 different rows for Alpha in 1970. The same holds for Beta in 1980.
The issue is that the variable of interest Val
Val2
Val3
have different scores in the duplicated rows (in terms of id/year).
I would like to find a concise way to produce the following final dataframe:
final<- data.frame("ID" = c("Alpha", "Alpha", "Alpha",
"Beta", "Beta","Beta" ),
"Year" = c( 1970, 1980, 1990, 1970,
1980,1990),
"Val" = c(1,1,1,0,0,1),
"Val2" = c(0,0,1,0,1,1),
"Val3" = c(2.34,1,0,0,3.2,1.3))
final
The logic is the following:
1) I want to have only one obs for every ID/year
2) For each variable of interest(Val
Val2
Val3
), I want only to retain the observation with the higher value in the single ID/year output
To illustrate. In df
there is a duplicated observation for Alpha/1970. I want to reduce this to a single row. Val
takes the value of 0 and 1, and in the final
dataframe only 1 is retained. Val2
takes the value of 0 and 0, so I simply keep 0 in final
. Val2
takes the value of 2.34 and 1, hence I want to retain in the final
dataframe.
I guess I could do this one column at the time using this data.table command
library(dplyr)
val1<- df %>% select(ID:Val)
library(data.table)
setDT(val1)[order(-Val)][,.SD[1,], by = .(ID, Year)]
val2<- df %>% select(ID,Year,Val2)
library(data.table)
setDT(val2)[order(-Val2)][,.SD[1,], by = .(ID, Year)]
and then merge the values together but there are several columns for which I have to run the operation and I was looking for a more concise way to do it.
I hope it is sufficiently clear
Thanks a lot for your help in advance
Upvotes: 2
Views: 51
Reputation: 887213
Here is one option
library(dplyr)
df %>%
group_by(ID, Year) %>%
summarise_all(max)
# for specific set of columns, use `summarise_at`
# summarise_at(vars(starts_with("Val")), max)
or with data.table
library(data.table)
nm1 <- grep('^Val\\d*$", names(df))
setDT(df)[, (nm1) := lapply(.SD, max), .(ID, Year), .SDcols = nm1]
Upvotes: 1
Reputation: 39858
You can do:
aggregate(. ~ ID + Year, FUN = max, data = df)
ID Year Val Val2 Val3
1 Alpha 1970 1 0 2.34
2 Beta 1970 0 0 0.00
3 Alpha 1980 1 0 1.30
4 Beta 1980 0 1 3.20
5 Alpha 1990 1 1 0.00
6 Beta 1990 1 1 1.30
Upvotes: 1