vicky
vicky

Reputation: 405

Remove duplicate rows with certain value in specific column

I have a data frame and I want to remove rows that are duplicated in all columns except one column and choose to keep the ones that are not certain values.

enter image description here

In above example, 3rd row and 4th row are duplicated for all columns except for col3, so I want to keep one row only. The complicated step is I want to keep 4th row instead of 3rd because 3rd row in col3 is "excluded". In general, I want to only keep the rows(that were duplicated) that do not have "excluded".

My real data frame have lots of duplicated rows and among those 2 rows that are duplicated, one of them is "excluded" for sure.

Below is re-producible ex:

a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d <- data.frame(a,b,c)

Thank you so much!

Update: Or, when removing duplicate, only keep the second observation (4th row).

Upvotes: 4

Views: 2002

Answers (5)

mastropi
mastropi

Reputation: 1414

Regarding your edit at the end of the question:

Update: Or, when removing duplicate, only keep the second observation (4th row).

note that, in case the ordering of the rows by col3 determines that the row to keep is always the last one among the duplicate records, you can simply set fromLast=TRUE in the duplicated() function to request that rows should be flagged as duplicates starting the duplicate count from the last one found for each duplicate group.

Using a slightly modified version of your data (where I added more duplicate groups to better show that the process works in a more general case):

a <- c(1,1,2,3,3,3,7)
b <- c(4,4,5,6,6,6,8)
c <- c("excluded", "red","green","excluded", "excluded","orange","excluded")
d <- data.frame(a,b,c)

  a b        c
1 1 4 excluded
2 1 4      red
3 2 5    green
4 3 6 excluded
5 3 6 excluded
6 3 6   orange
7 7 8 excluded

using:

ind2remove = duplicated(d[,c("a", "b")], fromLast=TRUE)
(d_noduplicates = d[!ind2remove,])

we get:

  a b        c
2 1 4      red
3 2 5    green
6 3 6   orange
7 7 8 excluded

Note that this doesn't require the rows in each duplicate group to be all together in the original data. The only important thing is that you want to keep the record showing up last in the data from each duplicate group.

Upvotes: 0

SKyJim
SKyJim

Reputation: 109

dplyr with some base R should work for this:

 library(dplyr) 
 a <- c(1,2,3,3,3,7)
 b <- c(4,5,6,6,6,8)
 c <- c("red","green","brown","excluded","orange","excluded")
 d <- data.frame(a,b,c)

 d <- filter(d, !duplicated(d[,1:2]) | c!="excluded")

Result: 
  a b        c
1 1 4      red
2 2 5    green
3 3 6    brown
4 3 6   orange
5 7 8 excluded

The filter will get rid of anything that should be excluded and not duplicated. I added an example of a none unique exclude to your example('brown') to test as well.

Upvotes: 1

broti
broti

Reputation: 1382

Shortening the approach by @Klone a bit, another dplyr solution:

d %>% mutate(c = factor(c, ordered = TRUE, 
                        levels = c("red", "green", "orange", "excluded"))) %>% # Order the factor variable
  arrange(c) %>% # Sort the data frame so that excluded comes first
  group_by(a, b) %>% # Group by the two columns that determine duplicates
  mutate(id = 1:n()) %>% # Assign IDs in each group
  filter(id == 1) # Only keep one row in each group

Result:

# A tibble: 4 x 4
# Groups:   a, b [4]
      a     b c           id
  <dbl> <dbl> <ord>    <int>
1     1     4 red          1
2     2     5 green        1
3     3     6 orange       1
4     7     8 excluded     1

Upvotes: 0

Klone
Klone

Reputation: 127

Here is a possibility ... I hope it can help :)

nquit <- (d %>%
  mutate(code= 1:nrow(d)) %>%
  group_by(a, b) %>%
  mutate(nDuplicate= n()) %>%
  filter(nDuplicate > 1) %>%
  filter(c == "excluded"))$code

e <- d[-nquit]

Upvotes: 0

Chelmy88
Chelmy88

Reputation: 1116

Here is an example with a loop:

a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d<- data.frame(a,b,c)

# Give row indices of duplicated rows (only the second and more occurence are given)
duplicated_rows=which(duplicated(d[c("a","b")]))

to_remove=c()
# Loop over different duplicated rows
for(i in duplicated_rows){
  # Find simmilar rows
  selection=which(d$a==d$a[i] & d$b==d$b[i])
  # Sotre indices of raw in the set of duplicated row whihc are "excluded"
  to_remove=c(to_remove,selection[which(d$c[selection]=="excluded")])
}

# Remove rows
d=d[-to_remove,]

print(d)

> a b       c
> 1 4      red
> 2 2 5   green
> 4 3 6   orange
> 5 7 8  excluded

Upvotes: 0

Related Questions