Davy Kavanagh
Davy Kavanagh

Reputation: 4939

duplicates in multiple columns

I have a data frame like so

> df
  a  b c    d
1 1  2 A 1001
2 2  4 B 1002
3 3  6 B 1002
4 4  8 C 1003
5 5 10 D 1004
6 6 12 D 1004
7 7 13 E 1005
8 8 14 E 1006

I want to remove the rows where there are repeated values in column c AND column d. So in this example rows 2,3,5 and 6 would removed.

I have used this, which works:

df[!(df$c %in% df$c[duplicated(df$c)] & df$d %in% df$d[duplicated(df$d)]),]
>df
  a  b c    d
1 1  2 A 1001
4 4  8 C 1003
7 7 13 E 1005
8 8 14 E 1006

but it seems clunky and I can't help but think there is a better way. Any suggestions?

In case anyone wants to re-create the data-frame here is the dput:

df <- data.frame(
  a = seq(1, 8, by = 1),
  b = c(2, 4, 6, 8, 10, 12, 13, 14),
  c = factor(c("A", "B", "B", "C", "D", "D", "E", "E")),
  d = c(1001, 1002, 1002, 1003, 1004, 1004, 1005, 1006)
)

Upvotes: 36

Views: 63822

Answers (2)

Tunn
Tunn

Reputation: 1536

Make a new object with the 2 columns:

df_dups <- df[c("c", "d")]

Now apply it to your main df:

df[!duplicated(df_dups),]

Looks neater and easy to see/change columns that you are using.

Upvotes: 28

Sven Hohenstein
Sven Hohenstein

Reputation: 81683

It works if you use duplicated twice:

df[!(duplicated(df[c("c","d")]) | duplicated(df[c("c","d")], fromLast = TRUE)), ]

  a  b c    d
1 1  2 A 1001
4 4  8 C 1003
7 7 13 E 1005
8 8 14 E 1006

Upvotes: 36

Related Questions