Reputation: 1644
I have a dataframe in r as follows:
df<-data.frame(id=rep(1:3,each=2),date=rep(c(NA,"date1","date2"),each=2),value=rep(c(4.1,9.4,5.6,6),c(2,2,1,1)))
# id date value
# 1 1 <NA> 4.1
# 2 1 <NA> 4.1
# 3 2 date1 9.4
# 4 2 date1 9.4
# 5 3 date2 5.6
# 6 3 date2 6.0
I would like to remove duplicate rows from the dataframe, such that if there is more than 1 row that has the same id, same date and same value, then I would only take the first row and remove others.
My first instinct is to do the following:
library(dplyr)
df %>% group_by(id,date,value) %>% slice(1)
However, what I want is
# id date value
# 1 1 <NA> 4.1
# 2 1 <NA> 4.1
# 3 2 date1 9.4
# 4 3 date2 5.6
# 5 3 date2 6.0
which is to disregard the slice(1)
for rows with empty "date".
How can I do this?
Upvotes: 1
Views: 127
Reputation: 887088
We can use filter
library(dplyr)
df %>%
filter(is.na(date)|!duplicated(cbind(id, date, value)))
# id date value
#1 1 <NA> 4.1
#2 1 <NA> 4.1
#3 2 date1 9.4
#4 3 date2 5.6
#5 3 date2 6.0
Or with slice
df %>%
group_by(id,date,value) %>%
slice(unique(c(1, which(is.na(date)))))
# A tibble: 5 x 3
# Groups: id, date, value [4]
# id date value
# <int> <fctr> <dbl>
#1 1 NA 4.1
#2 1 NA 4.1
#3 2 date1 9.4
#4 3 date2 5.6
#5 3 date2 6.0
With base R
, we can do this more compactly
df[!duplicated(df) |is.na(df$date),]
# id date value
#1 1 <NA> 4.1
#2 1 <NA> 4.1
#3 2 date1 9.4
#5 3 date2 5.6
#6 3 date2 6.0
Upvotes: 1