Reputation: 45
I have the following dataset (example)
idnumber=c(12,12,13,14,14,15,16,17,18,18)
reg = c('FR','FR','DE','US','US','TZ','MK','GR','ES','ES')
code1=c('F56','G76','G56','T78','G78','G76','G64','T65','G79','G56')
code2=c('G56','I89','J83','S46','D78','G56','H89','G56','W34','T89')
df = data.frame(idnumber,reg,code1,code2)
which gives:
idnumber reg code1 code2
1 12 FR F56 G56
2 12 FR G76 I89
3 13 DE G56 J83
4 14 US T78 S46
5 14 US G78 D78
6 15 TZ G76 G56
7 16 MK G64 H89
8 17 GR T65 G56
9 18 ES G79 W34
10 18 ES G56 T89
I would like to subset df
keeping only the raws where the value G56
appears in column code1
or code 2
, though keeping the raw idnumber
if the id value is the same id value matching with the value G56
such as:
idnumber reg code1 code2
1 12 FR F56 G56
2 12 FR G76 I89
3 13 DE G56 J83
6 15 TZ G76 G56
8 17 GR T65 G56
9 18 ES G79 W34
10 18 ES G56 T89
I have millions of observations and around 30 code
columns.
Hope the question is clear enough, any suggestion will be welcomed!
Cheers
Upvotes: 1
Views: 1304
Reputation: 19359
library(dplyr)
df %>%
semi_join(df %>%
filter(code1=="G56" | code2=="G56"),by="idnumber")
idnumber reg code1 code2
1 12 FR F56 G56
2 12 FR G76 I89
3 13 DE G56 J83
4 15 TZ G76 G56
5 17 GR T65 G56
6 18 ES G79 W34
7 18 ES G56 T89
Edit: This may be simpler with 30 code columns:
df %>%
semi_join(df %>%
pivot_longer(cols=-c(idnumber, reg)) %>%
filter(value=="G56") %>%
pivot_wider(id_cols=c(idnumber, reg)),
by="idnumber")
Third choice:
df %>%
semi_join(df %>%
filter_at(vars(starts_with("code")), any_vars(. == "G56")),
by="idnumber")
Edit: OP now wants to filter records if "G56" appears at least twice in the "code" columns (see comment below)
df %>%
semi_join(df %>%
mutate(n=rowSums(.[grep("code", names(.))] =="G56")) %>%
group_by(idnumber) %>%
filter(sum(n)>1),
by="idnumber")
idnumber reg code1 code2 code3
1 12 FR F56 G56 M56
2 12 FR G76 I89 G56
3 18 ES G79 W34 W33
4 18 ES G56 G56 T89
idnumber=c(12,12,13,14,14,15,16,17,18,18)
reg = c('FR','FR','DE','US','US','TZ','MK','GR','ES','ES')
code1=c('F56','G76','G56','T78','G78','G76','G64','T65','G79','G56')
code2=c('G56','I89','J83','S46','D78','G56','H89','G56','W34','G56')
code3=c('M56','G56','J83','S46','D78','G46','H89','J56','W33','T89')
df = data.frame(idnumber,reg,code1,code2,code3)
Upvotes: 0
Reputation: 102609
Another base R solution
subset(df,`class<-`(ave(cbind(as.character(code1),as.character(code2)),
idnumber,
FUN = function(v) ifelse("G56"%in%v,TRUE,FALSE)),"logical")[,1])
such that
idnumber reg code1 code2
1 12 FR F56 G56
2 12 FR G76 I89
3 13 DE G56 J83
6 15 TZ G76 G56
8 17 GR T65 G56
9 18 ES G79 W34
10 18 ES G56 T89
Upvotes: 1
Reputation: 35604
1. base
subset(df, idnumber %in% idnumber[code1=="G56" | code2=="G56"])
2. dplyr
library(dplyr)
df %>% filter(idnumber %in% idnumber[code1=="G56" | code2=="G56"])
Output
# idnumber reg code1 code2
# 1 12 FR F56 G56
# 2 12 FR G76 I89
# 3 13 DE G56 J83
# 4 15 TZ G76 G56
# 5 17 GR T65 G56
# 6 18 ES G79 W34
# 7 18 ES G56 T89
Upvotes: 1
Reputation: 21749
Here's a way to do:
library(data.table)
setDT(df)
df[,.SD[any(code1 == 'G56' | code2 == 'G56')] ,.(idnumber)]
idnumber reg code1 code2
1: 12 FR F56 G56
2: 12 FR G76 I89
3: 13 DE G56 J83
4: 15 TZ G76 G56
5: 17 GR T65 G56
6: 18 ES G79 W34
7: 18 ES G56 T89
Upvotes: 2