Reputation: 1304
I have a basic question,and i hope is not a replication. I had a look around but I was not able to find someone that directly addressed the question. I have a dataset that looks like this:
df <- data.frame("Reporter" = c("USA", "USA", "USA", "USA",
"EU","EU", "EU","EU" ),
"Year" = c(1970, 1970, 1980, 1990, 1970, 1980,
1980,1990),
"Val" = c(1,0,1,1,0,0,0,1),
"Val2" = c(0,0,0,1,0,1,0,1))
the issue is that I have some duplicated country year observation, and this create problems for my analysis. (val1 for US 1970, and val2 for EU 1980).
I would like to delete duplicated rows keeping always the highest value of non-matching scores (1). Since I have this issue for hundreds of rows, I was wondering whether there is an automatic approach to find out which data is duplicated and delete the one with the smaller value
I have found a way to individuate duplicated rows
df1<- df%>% select(Reporter, Year)
df1$duplicat <- duplicated(df1) %>% filter(duplicat=="TRUE")
but I am still not sure how to delete them according to the specified criteria. normally I would use anti_join, but it does not work as df1 data matches two rows.
I thank you in advance for your help
Upvotes: 1
Views: 1694
Reputation: 6496
You can use data.table:
setDT(df)[order(-Val)][,.SD[1,], by = .(Reporter, Year)]
Upvotes: 3
Reputation: 388982
A dplyr
option would be to arrange
the dataframe by Val
and Val2
and select the last row for each Reporter
and Year
.
library(dplyr)
df %>%
arrange(Val, Val2) %>%
group_by(Reporter, Year) %>%
slice(n())
# Reporter Year Val Val2
# <fct> <dbl> <dbl> <dbl>
#1 EU 1970 0 0
#2 EU 1980 0 1
#3 EU 1990 1 1
#4 USA 1970 1 0
#5 USA 1980 1 0
#6 USA 1990 1 1
Upvotes: 2
Reputation: 139
Using data.table::setorder()
I gave your problem a shot. You could use a pipe and put this into one line. Gives the same result as heds1 but if you have a large data set it would be worth it to compare the two with a benchmark.
Code:
dt1 <- data.table("Reporter" = c("USA", "USA", "USA", "USA",
"EU","EU", "EU","EU" ),
"Year" = c(1970, 1970, 1980, 1990, 1970, 1980,
1980,1990),
"Val1" = c(1,0,1,1,0,0,0,1),
"Val2" = c(0,0,0,1,0,1,0,1))
dt2 <- setorder(dt1, -`Val1`, -`Val2`)
dt3 <- unique(dt2, by = c("Reporter", "Year"))
Output:
Reporter Year Val1 Val2
1: USA 1990 1 1
2: EU 1990 1 1
3: USA 1970 1 0
4: USA 1980 1 0
5: EU 1980 0 1
6: EU 1970 0 0
Upvotes: 1
Reputation: 3438
One way would be to order the df
by Val
and Val2
, then remove duplicates where both Reporter
and Year
columns match:
df <- df[order(df[,'Val'],df[,'Val2'], decreasing=TRUE),]
df <- df[!duplicated(df[c("Reporter","Year")]),]
Reporter Year Val Val2
4 USA 1990 1 1
8 EU 1990 1 1
1 USA 1970 1 0
3 USA 1980 1 0
6 EU 1980 0 1
5 EU 1970 0 0
Upvotes: 2