Reputation: 405
I have a data frame and I want to remove rows that are duplicated in all columns except one column and choose to keep the ones that are not certain values.
In above example, 3rd row and 4th row are duplicated for all columns except for col3, so I want to keep one row only. The complicated step is I want to keep 4th row instead of 3rd because 3rd row in col3 is "excluded". In general, I want to only keep the rows(that were duplicated) that do not have "excluded".
My real data frame have lots of duplicated rows and among those 2 rows that are duplicated, one of them is "excluded" for sure.
Below is re-producible ex:
a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d <- data.frame(a,b,c)
Thank you so much!
Update: Or, when removing duplicate, only keep the second observation (4th row).
Upvotes: 4
Views: 2002
Reputation: 1414
Regarding your edit at the end of the question:
Update: Or, when removing duplicate, only keep the second observation (4th row).
note that, in case the ordering of the rows by col3
determines that the row to keep is always the last one among the duplicate records, you can simply set fromLast=TRUE
in the duplicated()
function to request that rows should be flagged as duplicates starting the duplicate count from the last one found for each duplicate group.
Using a slightly modified version of your data (where I added more duplicate groups to better show that the process works in a more general case):
a <- c(1,1,2,3,3,3,7)
b <- c(4,4,5,6,6,6,8)
c <- c("excluded", "red","green","excluded", "excluded","orange","excluded")
d <- data.frame(a,b,c)
a b c
1 1 4 excluded
2 1 4 red
3 2 5 green
4 3 6 excluded
5 3 6 excluded
6 3 6 orange
7 7 8 excluded
using:
ind2remove = duplicated(d[,c("a", "b")], fromLast=TRUE)
(d_noduplicates = d[!ind2remove,])
we get:
a b c
2 1 4 red
3 2 5 green
6 3 6 orange
7 7 8 excluded
Note that this doesn't require the rows in each duplicate group to be all together in the original data. The only important thing is that you want to keep the record showing up last in the data from each duplicate group.
Upvotes: 0
Reputation: 109
dplyr with some base R should work for this:
library(dplyr)
a <- c(1,2,3,3,3,7)
b <- c(4,5,6,6,6,8)
c <- c("red","green","brown","excluded","orange","excluded")
d <- data.frame(a,b,c)
d <- filter(d, !duplicated(d[,1:2]) | c!="excluded")
Result:
a b c
1 1 4 red
2 2 5 green
3 3 6 brown
4 3 6 orange
5 7 8 excluded
The filter will get rid of anything that should be excluded and not duplicated. I added an example of a none unique exclude to your example('brown') to test as well.
Upvotes: 1
Reputation: 1382
Shortening the approach by @Klone a bit, another dplyr solution:
d %>% mutate(c = factor(c, ordered = TRUE,
levels = c("red", "green", "orange", "excluded"))) %>% # Order the factor variable
arrange(c) %>% # Sort the data frame so that excluded comes first
group_by(a, b) %>% # Group by the two columns that determine duplicates
mutate(id = 1:n()) %>% # Assign IDs in each group
filter(id == 1) # Only keep one row in each group
Result:
# A tibble: 4 x 4
# Groups: a, b [4]
a b c id
<dbl> <dbl> <ord> <int>
1 1 4 red 1
2 2 5 green 1
3 3 6 orange 1
4 7 8 excluded 1
Upvotes: 0
Reputation: 127
Here is a possibility ... I hope it can help :)
nquit <- (d %>%
mutate(code= 1:nrow(d)) %>%
group_by(a, b) %>%
mutate(nDuplicate= n()) %>%
filter(nDuplicate > 1) %>%
filter(c == "excluded"))$code
e <- d[-nquit]
Upvotes: 0
Reputation: 1116
Here is an example with a loop:
a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d<- data.frame(a,b,c)
# Give row indices of duplicated rows (only the second and more occurence are given)
duplicated_rows=which(duplicated(d[c("a","b")]))
to_remove=c()
# Loop over different duplicated rows
for(i in duplicated_rows){
# Find simmilar rows
selection=which(d$a==d$a[i] & d$b==d$b[i])
# Sotre indices of raw in the set of duplicated row whihc are "excluded"
to_remove=c(to_remove,selection[which(d$c[selection]=="excluded")])
}
# Remove rows
d=d[-to_remove,]
print(d)
> a b c
> 1 4 red
> 2 2 5 green
> 4 3 6 orange
> 5 7 8 excluded
Upvotes: 0