Rolandp
Rolandp

Reputation: 11

How to mark/delete specific duplicates based on other variable(s)

I would like to know how can i delete specific rows based on specific values in columns, but these deletions depend on the other variable in the subgroup. I would like to delete "aja" if it is subgrouped together with "ase". If the subgroup has both "ase" or "aja", script should leave it alone. I have indicated which ones should be deleted by the script.

   id  somedata  subgroup
1  1   "aja"     okay
2  1   "aja"     okay
3  2   "ase"     okay
4  2   "aja"     delete
5  3   "aja"     delete
6  3   "ase"     okay
7  4   "aja"     okay
8  4   "aja"     okay
9  5   "ase"     okay
10 5   "ase"     okay
11 6   "aja"     delete
12 6   "ase"     okay




Code to generate the data

    id = c(1,1,2,2,3,3,4,4,5,5,6,6)
    somedata = c("aja","aja","ase","aja","aja","ase","aja","aja","ase","ase","aja","ase")
    subgroup = c("okay","okay","okay","DELETE","DELETE","okay","okay","okay","okay","okay","DELETE","okay")
    proov = data.frame(cbind(id,somedata,subgroup))

Upvotes: 0

Views: 50

Answers (3)

Marco
Marco

Reputation: 246

Without the use of any additional packages, you can use this command:

proov = proov[!(proov$id %in% unique(proov[which(proov$somedata == "ase"), "id"]) & proov$somedata == "aja"),]

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388862

We can group by id and remove rows where `somedata == "aja" and there is atleast one "ase"

library(dplyr)

proov %>% group_by(id) %>% filter(!(somedata == "aja" & any(somedata == "ase")))

#  id    somedata subgroup
# <fct> <fct>    <fct>   
#1 1     aja      okay    
#2 1     aja      okay    
#3 2     ase      okay    
#4 3     ase      okay    
#5 4     aja      okay    
#6 4     aja      okay    
#7 5     ase      okay    
#8 5     ase      okay    
#9 6     ase      okay    

which in base R can be written as

subset(proov, !as.logical(ave(as.character(somedata), 
               id, FUN = function(x) x == "aja" & any(x == "ase"))))

Upvotes: 2

Sotos
Sotos

Reputation: 51582

You can do a simple filtering, i.e.

library(dplyr)

proov %>% 
 group_by(id) %>% 
 filter(!(n_distinct(somedata) > 1 & somedata == 'aja'))

which gives,

# A tibble: 9 x 3
# Groups:   id [6]
  id    somedata subgroup
  <fct> <fct>    <fct>   
1 1     aja      okay    
2 1     aja      okay    
3 2     ase      okay    
4 3     ase      okay    
5 4     aja      okay    
6 4     aja      okay    
7 5     ase      okay    
8 5     ase      okay    
9 6     ase      okay    

Upvotes: 2

Related Questions