melatonin15
melatonin15

Reputation: 2269

Creating subset of rows depending on multiple values in one column

I have a data file like this, where the first two columns are corresponding gene names in human and rat(basically same name with different capitalization)

Human   Rat     RNAtype    Exp  Organ
BBS1    Bbs1    reg         7   Lung
ASAP2   Asap2   reg         5   Heart
ASAP2   Asap2   CANONICAL   5   Heart
ASAP2   Asap2   reg         6   Heart
ASAP2   Asap2   reg         8   Lung
ASAP3   Asap3   SCRAMBLED   5   Heart
ASAP3   Asap3   scram       8   Heart
ASAP3   Asap3   CANONICAL   5   Heart
ASAP3   Asap3   reg         5   Heart

Now I want to make subset of it in the following way. If a gene is present in both Lung and heart then I will include this in my subset. for example, here the gene in the first line, BBS1 is present in only Lung but not in Heart, so it won't be in the subset. But ASAP2 is present in both heart and lung. So, it will in the subset. ASAP3 won't be in the subset since it's not present in Lung.

Now the actual data file is pretty big. And there are lots of genes and the order of gene name coming in the first/second column is not ordered like the way here ie it can be ASAP2,ASAP3,ASAP3,BBS1,ASAP2, ASAP2.

Upvotes: 1

Views: 678

Answers (4)

David Arenburg
David Arenburg

Reputation: 92282

A data.table solution

library(data.table)
setDT(df)[, indx := length(unique(Organ)) >= 2, by = Rat][indx == TRUE]

##    Human   Rat   RNAtype Exp Organ indx
## 1: ASAP2 Asap2       reg   5 Heart TRUE
## 2: ASAP2 Asap2 CANONICAL   5 Heart TRUE
## 3: ASAP2 Asap2       reg   6 Heart TRUE
## 4: ASAP2 Asap2       reg   8  Lung TRUE

With the courtesy of @Arun, here are couple even cooler data.table solutions

setDT(df)[, .SD[length(unique(Organ)) >= 2L], by=Rat]

Or

setDT(df)[df[, .I[length(unique(Organ))>=2L], by=Rat]$V1]

Upvotes: 2

talat
talat

Reputation: 70256

If your Organ column only contains those two values Lung and Heart, here's another approach:

library(dplyr)

dfsub <- df %>%
  group_by(Human) %>%
  filter(length(unique(Organ)) >= 2)

#Source: local data frame [4 x 5]
#Groups: Human
#
#  Human   Rat   RNAtype Exp Organ
#1 ASAP2 Asap2       reg   5 Heart
#2 ASAP2 Asap2 CANONICAL   5 Heart
#3 ASAP2 Asap2       reg   6 Heart
#4 ASAP2 Asap2       reg   8  Lung

Upvotes: 2

Thomas
Thomas

Reputation: 44525

Here's another strategy:

> u <- with(df, by(Human, Organ, unique))
> df[df$Human %in% intersect(u[[1]], u[[2]]), ]
  Human   Rat   RNAtype Exp Organ
2 ASAP2 Asap2       reg   5 Heart
3 ASAP2 Asap2 CANONICAL   5 Heart
4 ASAP2 Asap2       reg   6 Heart
5 ASAP2 Asap2       reg   8  Lung

Upvotes: 3

Jota
Jota

Reputation: 17611

Try this:

# read in sample data
df<-read.table(text="
Human   Rat     RNAtype    Exp  Organ
BBS1    Bbs1    reg         7   Lung
ASAP2   Asap2   reg         5   Heart
ASAP2   Asap2   CANONICAL   5   Heart
ASAP2   Asap2   reg         6   Heart
ASAP2   Asap2   reg         8   Lung
ASAP3   Asap3   SCRAMBLED   5   Heart
ASAP3   Asap3   scram       8   Heart
ASAP3   Asap3   CANONICAL   5   Heart
ASAP3   Asap3   reg         5   Heart", header=TRUE)

merge(df[df$Organ=="Heart",], df[df$Organ=="Lung",], by=c("Human","Rat"))

  Human   Rat RNAtype.x Exp.x Organ.x RNAtype.y Exp.y Organ.y
1 ASAP2 Asap2       reg     5   Heart       reg     8    Lung
2 ASAP2 Asap2 CANONICAL     5   Heart       reg     8    Lung
3 ASAP2 Asap2       reg     6   Heart       reg     8    Lung

Alternatively:

genes2keep <- intersect(df[df$Organ=="Heart",]$Human, df[df$Organ=="Lung",]$Human)
df[df$Human %in% genes2keep, ]

  Human   Rat   RNAtype Exp Organ
2 ASAP2 Asap2       reg   5 Heart
3 ASAP2 Asap2 CANONICAL   5 Heart
4 ASAP2 Asap2       reg   6 Heart
5 ASAP2 Asap2       reg   8  Lung

Upvotes: 2

Related Questions