Reputation: 45

Conditional subset data frame based on value in multiple columns

I have the following dataset (example)

idnumber=c(12,12,13,14,14,15,16,17,18,18)
reg = c('FR','FR','DE','US','US','TZ','MK','GR','ES','ES')
code1=c('F56','G76','G56','T78','G78','G76','G64','T65','G79','G56')
code2=c('G56','I89','J83','S46','D78','G56','H89','G56','W34','T89')
df = data.frame(idnumber,reg,code1,code2)

which gives:


  idnumber reg code1 code2
1   12     FR   F56   G56
2   12     FR   G76   I89
3   13     DE   G56   J83
4   14     US   T78   S46
5   14     US   G78   D78
6   15     TZ   G76   G56
7   16     MK   G64   H89
8   17     GR   T65   G56
9   18     ES   G79   W34
10  18     ES   G56   T89

I would like to subset df keeping only the raws where the value G56 appears in column code1 or code 2, though keeping the raw idnumber if the id value is the same id value matching with the value G56 such as:


  idnumber reg code1 code2
1   12     FR   F56   G56
2   12     FR   G76   I89
3   13     DE   G56   J83
6   15     TZ   G76   G56
8   17     GR   T65   G56
9   18     ES   G79   W34
10  18     ES   G56   T89

I have millions of observations and around 30 code columns. Hope the question is clear enough, any suggestion will be welcomed!

Cheers

Upvotes: 1

Answers (4)

Edward

Reputation: 19359

library(dplyr)
df %>%
  semi_join(df %>%
  filter(code1=="G56" | code2=="G56"),by="idnumber")

  idnumber reg code1 code2
1       12  FR   F56   G56
2       12  FR   G76   I89
3       13  DE   G56   J83
4       15  TZ   G76   G56
5       17  GR   T65   G56
6       18  ES   G79   W34
7       18  ES   G56   T89

Edit: This may be simpler with 30 code columns:

df %>%
  semi_join(df %>%
              pivot_longer(cols=-c(idnumber, reg)) %>%
              filter(value=="G56") %>%
              pivot_wider(id_cols=c(idnumber, reg)),
            by="idnumber")

Third choice:

df %>%
  semi_join(df %>%
              filter_at(vars(starts_with("code")), any_vars(. == "G56")),
            by="idnumber")

Edit: OP now wants to filter records if "G56" appears at least twice in the "code" columns (see comment below)

df %>%
  semi_join(df %>%
  mutate(n=rowSums(.[grep("code", names(.))] =="G56")) %>%
  group_by(idnumber) %>%
  filter(sum(n)>1),
  by="idnumber")

  idnumber reg code1 code2 code3
1       12  FR   F56   G56   M56
2       12  FR   G76   I89   G56
3       18  ES   G79   W34   W33
4       18  ES   G56   G56   T89

Data:

idnumber=c(12,12,13,14,14,15,16,17,18,18)
reg = c('FR','FR','DE','US','US','TZ','MK','GR','ES','ES')
code1=c('F56','G76','G56','T78','G78','G76','G64','T65','G79','G56')
code2=c('G56','I89','J83','S46','D78','G56','H89','G56','W34','G56')
code3=c('M56','G56','J83','S46','D78','G46','H89','J56','W33','T89')
df = data.frame(idnumber,reg,code1,code2,code3)

Upvotes: 0

ThomasIsCoding

Reputation: 102609

Another base R solution

subset(df,`class<-`(ave(cbind(as.character(code1),as.character(code2)),
                      idnumber,
                      FUN = function(v) ifelse("G56"%in%v,TRUE,FALSE)),"logical")[,1])

such that

   idnumber reg code1 code2
1        12  FR   F56   G56
2        12  FR   G76   I89
3        13  DE   G56   J83
6        15  TZ   G76   G56
8        17  GR   T65   G56
9        18  ES   G79   W34
10       18  ES   G56   T89

Upvotes: 1

Darren Tsai

Reputation: 35604

1. base

subset(df, idnumber %in% idnumber[code1=="G56" | code2=="G56"])

2. dplyr

library(dplyr)

df %>% filter(idnumber %in% idnumber[code1=="G56" | code2=="G56"])

Output

#   idnumber reg code1 code2
# 1       12  FR   F56   G56
# 2       12  FR   G76   I89
# 3       13  DE   G56   J83
# 4       15  TZ   G76   G56
# 5       17  GR   T65   G56
# 6       18  ES   G79   W34
# 7       18  ES   G56   T89

Upvotes: 1

YOLO

Reputation: 21749

Here's a way to do:

library(data.table)
setDT(df)
df[,.SD[any(code1 == 'G56' | code2 == 'G56')] ,.(idnumber)]

   idnumber reg code1 code2
1:       12  FR   F56   G56
2:       12  FR   G76   I89
3:       13  DE   G56   J83
4:       15  TZ   G76   G56
5:       17  GR   T65   G56
6:       18  ES   G79   W34
7:       18  ES   G56   T89

Upvotes: 2

Conditional subset data frame based on value in multiple columns

Answers (4)

Data:

Related Questions