user2165857
user2165857

Reputation: 2690

R: subset dataframe based on column entry in multiple rows

I have a dataframe with information on several genes in a format similar to:

chr    start    end    Gene    Region
1    100    110    Bat     Exon
1    120    130    Bat     Intron
1    500    550    Ball    Upstream, Downstream
1    590    600    Ball    Intron, Upstream
1    900    980    Mit     Promoter, Upstream

I would like to subset the data to remove any rows that contains genes that have "Exon" or "Promoter" in the Regions column. I had been using:

Regions <- subset(Table, Region == "Intron" | Region== "DownStream" | Region =="Upstream" | Region=="DownStream,Upstream")

However this gives me:

chr    start    end    Gene    Region
1    120    130    Bat     Intron
1    500    550    Ball    Upstream, Downstream
1    590    600    Ball    Intron, Upstream

What I want is:

chr    start    end    Gene    Region
1    500    550    Ball    Upstream, Downstream
1    590    600    Ball    Intron, Upstream

Upvotes: 0

Views: 234

Answers (1)

talat
talat

Reputation: 70256

Try this using grepl:

df[!grepl("Exon|Promoter", df$Region),]
#  chr start end Gene               Region
#2   1   120 130  Bat               Intron
#3   1   500 550 Ball Upstream, Downstream
#4   1   590 600 Ball     Intron, Upstream

It's not clear to me why you want the row 2 with "Intron" removed as well. Please explain that.

Edit:

Think I understood now, try this instead:

temp <- df$Gene[grepl("Exon|Promoter", df$Region)]
df[!df$Gene %in% temp,]
#  chr start end Gene               Region
#3   1   500 550 Ball Upstream, Downstream
#4   1   590 600 Ball     Intron, Upstream

Upvotes: 2

Related Questions