Reputation: 399
I have a data.frame which has duplicate observations, how do I delete all the duplicated ones based on the first column (if their first data is the same, then delete these entries entirely)?
> a=c(1,4,5,5,6,6)
> b=c(2,5,7,4,4,2)
> c=c("a","b","c","a","b","c")
> test=data.frame(a,b,c)
> test
a b c
1 1 2 a
2 4 5 b
3 5 7 c
4 5 4 a
5 6 4 b
6 6 2 c
I don't want to keep any of the duplicate rows so that my final output will be
a b c
1 1 2 a
2 4 5 b
I've tried unique
and duplicate
function but they both keep the first duplicate rows (i.e., if there are 5 duplicate records then 4 of them will be deleted), like
a b c
1 1 2 a
2 4 5 b
3 5 7 c
4 6 4 b
What should I do? Thanks!
Upvotes: 1
Views: 2275
Reputation: 1179
Easy one step removal of duplicates:
my_df <- my_df[-which(duplicated(my_df)), ]
Upvotes: -1
Reputation: 21507
Using dplyr
require(dplyr)
test <- test %>% group_by(a) %>% filter(n()==1)
test
a b c
1 1 2 a
2 4 5 b
Upvotes: 2
Reputation: 12411
You first search for the first column values of the duplicated rows:
val <- test[duplicated(test[,1]),1]
[1] 5 6
Then you search for the rows in which these values can be found
rows <- test[,1] %in% test[duplicated(test[,1]),1]
[1] FALSE FALSE TRUE TRUE TRUE TRUE
Then you select all rows except these:
test[! rows,]
a b c
1 1 2 a
2 4 5 b
Upvotes: 1
Reputation: 179558
You can use table()
to get a frequency table of your column, then use the result to subset:
singletons <- names(which(table(test$a) == 1))
test[test$a %in% singletons, ]
a b c
1 1 2 a
2 4 5 b
Upvotes: 3
Reputation: 3525
Strange request, but if you want to remove all rows where there is a duplicate in any column while ignoring the other columns:
test[!duplicated(test$a) & ! duplicated(test$b) & ! duplicated(test$c),]
a b c
1 1 2 a
2 4 5 b
3 5 7 c
But I don't see how '5 7 c' is a duplicate in your example.
Upvotes: 0