Python Newbie
Python Newbie

Reputation: 277

R: Subsetting data more efficiently

I have a dataset df:

df=data.frame(rbind(c("A",1,1,"abc"),
                    c("B",0,0,"def"),
                    c("C",0,1,"hep"),
                    c("A",1,1,"hit"),
                    c("B",0,1,"occ"),
                    c("C",1,1,"tem"),
                    c("A",1,1,"twi"),
                    c("B",1,1,"twa"),
                    c("C",1,1,"mit"),
                    c("A",1,1,"mot"),
                    c("C",1,1,"mot"),
                    c("B",1,1,"mjak")))
names(df)=c("id","v1","v2","check")

I want to create a subset of ids in DF, that contain values included in the "ch.vars" vector in the "check" column.

ch.vars=c("abc","hit","mot","twi","mjak")

If an id contains any values other than that given in "ch.vars" they are to be excluded form the dataset.For example ids B and C contain other values in the check column, so they are to be excluded in the subset.

Here is what I have tried so far:

df$check.var=ifelse(df$check %in% ch.vars,1,0)
df=arrange(df,id)

st1=filter(df,check.var==0)
st1=as.character(unique(st1$id))

df2=df[!df$id %in% st1,]

> df2
  id v1 v2 check check.var
1  A  1  1   abc         1
2  A  1  1   hit         1
3  A  1  1   twi         1
4  A  1  1   mot         1

This works but I was wondering if there was a more efficient way to do this, i.e achieve the result in less steps. Thank you!

Upvotes: 1

Views: 67

Answers (2)

cryo111
cryo111

Reputation: 4474

And a data.table solution:

library(data.table)
data.table(df)[,.SD[all(check%in%ch.vars)],by="id"]
#   id v1 v2 check
#1:  A  1  1   abc
#2:  A  1  1   hit
#3:  A  1  1   twi
#4:  A  1  1   mot

You can also use setkey for id to make it faster.

Upvotes: 3

David Robinson
David Robinson

Reputation: 78610

You can do this with group_by and filter in the dplyr package:

library(dplyr)
df2 = df %>%
  group_by(id) %>%
  filter(all(check %in% ch.vars))

Upvotes: 3

Related Questions