Reputation: 877

Eliminate all duplicates except NA values in R

I would like to eliminate all duplicates except NA values.

I Have this File:

Name weight
John .  10
John .  12
NA .    12
NA .    12
NA .    13
Peter . 15
Andy .  16
Clark . 17

And I need this:

Name weight
 NA .    12
 NA .    12
 NA .    13
 Peter . 15
 Andy .  16
 Clark . 17

I tried this code:

New.dt=dt[!(duplicated(dt$Name) | duplicated(dt$Name, fromLast = TRUE)), ]

But I can this:

Name weight
Peter . 15
Andy .  16
Clark . 17

And I want to keep the NA values.

Upvotes: 2

Answers (2)

r2evans

Reputation: 160407

The double-tap of duplicated is faster (I thought duplicated would be slightly less-efficient with larger data), I suggest you go with that answer.

My answer is kept for the record.

One problem with using duplicated is that it will never remove all duplicates, since one it removes all but one of them, it is no longer duplicated.

A one-liner:

x[ !x$Name %in% names(Filter(c, table(x$Name, useNA = "no") - 1)), ]
#    Name weight
# 3  <NA>     12
# 4  <NA>     12
# 5  <NA>     13
# 6 Peter     15
# 7  Andy     16
# 8 Clark     17

Explanation:

table(x$Name, ...) will give you a named vector with the count of each element within the Name column;
and though it is the default, I'm adding table(..., useNA="no") to be explicit, this means that NA values are not included in the returned vector of counts (thereby meeting your "except NA values" constraint);
Filter(c, ...) filters the named vector based on a truthy-value of the contents, where "0" is considered non-truthy (and therefore removed) ... but since table will always return 1 or more (because it has to find one to include it in the list), ...
I do table(...) - 1 to reduce all singles (count of 1) to 0, so that the Filter(c,...) part can work;
names(...) returns the Name values that have an effective count of 2 or more; and
!x$Name %in% ... does the actual removal.

Data

x <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
Name weight
John    10
John    12
NA      12
NA      12
NA      13
Peter   15
Andy    16
Clark   17")

Upvotes: 1

mzakaria

Reputation: 649

Quick and dirty

New.dt=dt[!(duplicated(dt$Name) | duplicated(dt$Name, fromLast = TRUE)), ]
dt2 = dt[dt$Name = is.na(dt)]
rbind(New.dt, dt2)

Upvotes: 4

Eliminate all duplicates except NA values in R

Answers (2)

Related Questions