ahoosh
ahoosh

Reputation: 1360

Erasing duplicates with NA values

I have a data frame like this:

names <- c('Mike','Mike','Mike','John','John','John','David','David','David','David')
dates <- c('04-26','04-26','04-27','04-28','04-27','04-26','04-01','04-02','04-02','04-03')
values <- c(NA,1,2,4,5,6,1,2,NA,NA)
test <- data.frame(names,dates,values)

Which is:

  names dates values
1  Mike 04-26     NA
2  Mike 04-26      1
3  Mike 04-27      2
4  John 04-28      4
5  John 04-27      5
6  John 04-26      6
7 David 04-01      1
8 David 04-02      2
9 David 04-02     NA
10 David 04-03    NA

I'd like to get rid of duplicates with NA values. So, in this case, I have a valid observation from Mike on 04-26 and also have a valid observation from David on 04-02, so rows 1 and 9 should be erased and I will end up with:

      names dates values
1  Mike 04-26      1
2  Mike 04-27      2
3  John 04-28      4
4  John 04-27      5
5  John 04-26      6
6 David 04-01      1
7 David 04-02      2
8 David 04-03      NA

I tried to use duplicated function, something like this:

test[!duplicated(test[,c('names','dates')]),]

But that does not work since some NA values come before the valid value. Do you have any suggestions without trying things like merge or making another data frame?

Update: I'd like to keep rows with NA that are not duplicates.

Upvotes: 1

Views: 83

Answers (3)

lmo
lmo

Reputation: 38510

Here is an attempt in data.table:

# set up
libary(data.table)
setDT(test)

# construct condition
test[, dupes := max(duplicated(.SD)), .SDcols=c("names", "dates"), by=c("names", "dates")]

# print out result
test[dupes == 0 | !is.na(values),]

Here is a similar method using base R, except that the dupes variable is kept separately from the data.frame:

dupes <- duplicated(test[c("names", "dates")])
# this generates warnings, but works nonetheless
dupes <- ave(dupes, test$names, test$dates, FUN=max)
# print out result
test[dupes == 0 | !is.na(test$values),]

If there are duplicated rows where the values variable is NA, and these duplicates add nothing to the data, then you can drop them prior to running the code above:

testNoNADupes <- test[!(duplicated(test) & is.na(test$values)),]

Upvotes: 1

G. Cocca
G. Cocca

Reputation: 2541

This should work based on your sample.

test <- test[order(test$values),]
test <- test[!(duplicated(test$names) & duplicated(test$dates) & is.na(test$values)),] 

Upvotes: 0

DatamineR
DatamineR

Reputation: 9618

What about this way?

library(dplyr)
test %>% group_by(names, dates) %>% filter((n()>=2 & !is.na(values)) | n()==1)
Source: local data frame [8 x 3]
Groups: names, dates [8]

   names  dates values
  (fctr) (fctr)  (dbl)
1   Mike  04-26      1
2   Mike  04-27      2
3   John  04-28      4
4   John  04-27      5
5   John  04-26      6
6  David  04-01      1
7  David  04-02      2
8  David  04-03     NA

Upvotes: 3

Related Questions