Reputation: 353

Delete rows in R data.frame based on duplicate values in one column only

I would like to know how can I delete duplicate row entries based on the identifier number in the first column of the data frame. Most functions like duplicated() and unique() check every single value in a row in order to identify duplicate rows. On the other hand, I'm interested in identifying duplicates on the basis of a single column only.

Here's an example:

ID  Test   Date Taken
1   POS    1/1/15
1   POS    2/8/14
2   NEG    7/9/13
2   NEG    4/10/12
2   NEG    2/5/08

and the desired result:

ID  Test   Date Taken
1   POS    1/1/15
2   NEG    7/9/13

Upvotes: 0

Answers (4)

MilesMcBain

Reputation: 1205

I think you actually want to use a filter() operation for this in combination with arrange()

For example:

df %>%
arrange(desc(`Date Taken`)) %>%
group_by(ID) %>%
filter(row_number(`Date Taken`) == 1)

would get you the most recent observation for each ID.

You could also use a summarise():

df %>%
arrange(desc(`Date Taken`)) %>%
group_by(ID) %>%
summarise(ID = first(ID))

If you didn't care about Date Taken making it into the result.

Upvotes: 1

Sotos

Reputation: 51582

You can use

df[!duplicated(df$ID),]

Upvotes: 3

Alexi Coard

Reputation: 7742

You can use the duplicated function. If df is your dataframe :

df[duplicated(df$ID), ]

will returns you (duplicate is based on the ID here)

ID  Test   Date Taken
1   POS    1/1/15
2   NEG    7/9/13

Upvotes: 1

akrun

Reputation: 886938

We can use unique

library(data.table)
unique(setDT(df1), by = "ID")

Upvotes: 1

Delete rows in R data.frame based on duplicate values in one column only

Answers (4)

Related Questions