Reputation: 353
I would like to know how can I delete duplicate row entries based on the identifier number in the first column of the data frame. Most functions like duplicated()
and unique()
check every single value in a row in order to identify duplicate rows. On the other hand, I'm interested in identifying duplicates on the basis of a single column only.
Here's an example:
ID Test Date Taken
1 POS 1/1/15
1 POS 2/8/14
2 NEG 7/9/13
2 NEG 4/10/12
2 NEG 2/5/08
and the desired result:
ID Test Date Taken
1 POS 1/1/15
2 NEG 7/9/13
Upvotes: 0
Views: 5027
Reputation: 1205
I think you actually want to use a filter()
operation for this in combination with arrange()
For example:
df %>%
arrange(desc(`Date Taken`)) %>%
group_by(ID) %>%
filter(row_number(`Date Taken`) == 1)
would get you the most recent observation for each ID.
You could also use a summarise()
:
df %>%
arrange(desc(`Date Taken`)) %>%
group_by(ID) %>%
summarise(ID = first(ID))
If you didn't care about Date Taken
making it into the result.
Upvotes: 1
Reputation: 7742
You can use the duplicated function. If df is your dataframe :
df[duplicated(df$ID), ]
will returns you (duplicate is based on the ID here)
ID Test Date Taken
1 POS 1/1/15
2 NEG 7/9/13
Upvotes: 1
Reputation: 886938
We can use unique
library(data.table)
unique(setDT(df1), by = "ID")
Upvotes: 1