Reputation: 1304

Removing duplicate rows in panel data by criteria

I have a basic question,and i hope is not a replication. I had a look around but I was not able to find someone that directly addressed the question. I have a dataset that looks like this:

   df <- data.frame("Reporter" = c("USA", "USA", "USA", "USA", 
"EU","EU", "EU","EU" ), 
                 "Year" = c(1970, 1970, 1980, 1990, 1970, 1980, 
 1980,1990), 
                 "Val" = c(1,0,1,1,0,0,0,1),
                "Val2" = c(0,0,0,1,0,1,0,1))

the issue is that I have some duplicated country year observation, and this create problems for my analysis. (val1 for US 1970, and val2 for EU 1980).

I would like to delete duplicated rows keeping always the highest value of non-matching scores (1). Since I have this issue for hundreds of rows, I was wondering whether there is an automatic approach to find out which data is duplicated and delete the one with the smaller value

I have found a way to individuate duplicated rows

df1<- df%>% select(Reporter, Year)
df1$duplicat <- duplicated(df1) %>% filter(duplicat=="TRUE")

but I am still not sure how to delete them according to the specified criteria. normally I would use anti_join, but it does not work as df1 data matches two rows.

I thank you in advance for your help

Upvotes: 1

Answers (4)

PavoDive

Reputation: 6496

You can use data.table:

setDT(df)[order(-Val)][,.SD[1,], by = .(Reporter, Year)]

Upvotes: 3

Ronak Shah

Reputation: 388982

A dplyr option would be to arrange the dataframe by Val and Val2 and select the last row for each Reporter and Year.

library(dplyr)

df %>%
  arrange(Val, Val2) %>%
  group_by(Reporter, Year) %>%
  slice(n())

#  Reporter  Year   Val  Val2
#  <fct>    <dbl> <dbl> <dbl>
#1 EU        1970     0     0
#2 EU        1980     0     1
#3 EU        1990     1     1
#4 USA       1970     1     0
#5 USA       1980     1     0
#6 USA       1990     1     1

Upvotes: 2

A. Korinda

Reputation: 139

Using data.table::setorder() I gave your problem a shot. You could use a pipe and put this into one line. Gives the same result as heds1 but if you have a large data set it would be worth it to compare the two with a benchmark.

Code:

dt1 <- data.table("Reporter" = c("USA", "USA", "USA", "USA", 
                                "EU","EU", "EU","EU" ), 
                 "Year" = c(1970, 1970, 1980, 1990, 1970, 1980, 
                            1980,1990), 
                 "Val1" = c(1,0,1,1,0,0,0,1),
                 "Val2" = c(0,0,0,1,0,1,0,1))

dt2 <- setorder(dt1, -`Val1`, -`Val2`)
dt3 <- unique(dt2, by = c("Reporter", "Year"))

Output:

   Reporter Year Val1 Val2
1:      USA 1990    1    1
2:       EU 1990    1    1
3:      USA 1970    1    0
4:      USA 1980    1    0
5:       EU 1980    0    1
6:       EU 1970    0    0

Upvotes: 1

heds1

Reputation: 3438

One way would be to order the df by Val and Val2, then remove duplicates where both Reporter and Year columns match:

df <- df[order(df[,'Val'],df[,'Val2'], decreasing=TRUE),]
df <- df[!duplicated(df[c("Reporter","Year")]),]


  Reporter Year Val Val2
4      USA 1990   1    1
8       EU 1990   1    1
1      USA 1970   1    0
3      USA 1980   1    0
6       EU 1980   0    1
5       EU 1970   0    0

Upvotes: 2

Removing duplicate rows in panel data by criteria

Answers (4)

Related Questions