Reputation: 27
I have a large dataset with one column of gene names and 4 columns of detection methods (that in this case I called them X1, X2, X3 and X4). I would like to select the rows where the genes are selected by at least 2 detection method. Example of the table is:
Table:
Row Gene X1 X2 X3 X4
1 A 1 0 0 0
2 A 0 0 1 0
3 A 0 1 0 0
4 B 0 0 1 0
5 B 0 0 1 0
6 C 0 0 0 1
7 D 0 0 1 0
8 D 0 1 0 0
9 D 0 1 0 0
10 E 0 0 1 0
11 E 0 0 1 0
In summary, I want to select the rows 1,2,3 (Methods X1, X2 and X3 detected gene A) and rows 7,8,9 where methods X2 and X3 detected gene D.
Thanks for your help.
Upvotes: 2
Views: 596
Reputation: 2617
To show which genes were detected by two or more methods, this will work.
if zz
is your data.frame, then:
yy <- by(zz, zz$Gene, function(dat) {sum(apply(dat[,-c(1,2)], 2, any)) >= 2} )
zz[zz$Gene %in% which(yy),]
# load the data:
zz <- read.table(header = TRUE, text = "
Row Gene X1 X2 X3 X4
1 A 1 0 0 0
2 A 0 0 1 0
3 A 0 1 0 0
4 B 0 0 1 0
5 B 0 0 1 0
6 C 0 0 0 1
7 D 0 0 1 0
8 D 0 1 0 0
9 D 0 1 0 0
10 E 0 0 1 0
11 E 0 0 1 0")
# now check, gene by gene, whether at least two columns have at least one 1.
# note that the repeated any() statements can be replaced by a loop or
# apply(), but for only four columns this works, is easy enough to type,
# and much easier to understand
yy <- by(zz, zz$Gene, function(dat) {(any(dat$X1) +
any(dat$X2) +
any(dat$X3) +
any(dat$X4) ) >= 2} )
# or, the apply way, in case there are a lot of columns.
# "-c(1,2)" as a column index means "every column except the first two",
# so if the data has 3, 4, or 30 methods, this code stays the same.
yy <- by(zz, zz$Gene, function(dat) {sum(apply(dat[,-c(1,2)], 2, any)) >= 2} )
yy
zz$Gene: A
[1] TRUE
---------------------------------------------------------------------------
zz$Gene: B
[1] FALSE
---------------------------------------------------------------------------
zz$Gene: C
[1] FALSE
---------------------------------------------------------------------------
zz$Gene: D
[1] TRUE
---------------------------------------------------------------------------
zz$Gene: E
[1] FALSE
Now to find the matching rows to the Genes that got TRUE
results.
Find the names of zz
(A, B, C, ... ) that correspond to yy
values of TRUE
, and index the data.frame based on that...
which(yy) # equivalent to which(yy == TRUE)
gives
A D
1 4
and
names(which(yy))
gives
[1] "A" "D"
so...
zz[zz$Gene %in% names(which(yy)),]
gives
Row Gene X1 X2 X3 X4
1 1 A 1 0 0 0
2 2 A 0 0 1 0
3 3 A 0 1 0 0
7 7 D 0 0 1 0
8 8 D 0 1 0 0
9 9 D 0 1 0 0
Upvotes: 1
Reputation: 39737
You can use rowsum
and rowSums
to find those with more than 1 method and %in%
to find the matched rows.
x <- rowSums(rowsum(zz[3:6], zz[,2]) > 0) > 1
zz$Row[zz$Gene %in% names(x[x])]
#[1] 1 2 3 7 8 9
Upvotes: 2