user2615132
user2615132

Reputation: 1

Filter data frame to retain rows that meet certain criteria

I have the following data frame that I am trying to filter. I only want to retain the rows where at least one value in the row is greater than 0.5. Any help is appreciated. I tried to do the following but my system hangs:

gbpre.mat<-as.matrix(gbpre)
ind <- apply(gbpre.mat, 1, function(gbpre.mat) any(gbpre > 0.5))

PH1544_pre PH1545_pre PH1565_pre PH1571_pre PH1612_pre  PH1616_pre
bg00050873 0.88235087  0.6053853  0.6521263  0.2770632 0.82596713 0.635325831
bg00212031 0.01175069  0.1844859  0.4345596  0.2186097 0.03717635 0.670305781
bg00213748 0.64571987  0.7316865  0.4345596  0.5613724 0.81309068 0.900878028
bg00214611 0.04405524  0.7103071  0.6810916  0.6526317 0.03412550 0.008187867
bg00455876 0.72122206  0.1272784  0.2155168  0.4794622 0.70089805 0.668497074
bg01707559 0.03592823  0.3548602  0.2743443  0.2194279 0.57761264 0.061564411

Upvotes: 0

Views: 87

Answers (1)

Stibu
Stibu

Reputation: 15907

The reason that your definition of ind does not work is that in the function you apply, you are not using the argument of the function, but rather the whole of gbpre. If your matrix is large, this might be slow, because for each of the many rows of the matrix the entire large matrix is checked.

To be more specific: This is your definition:

ind <- apply(gbpre.mat, 1, function(gbpre.mat) any(gbpre > 0.5))

You use apply over rows, which is fine. Then you define a function of one argument. The argument is called gbpre.mat, which is possible, but I would recommend that you don't use the same name as the variable that you want to pass into the function. This would avoid some confusion. The function does then not even use gbpre.mat, so the result of the function is independent of it's input. This is not what you want.

So you should rather use the following:

ind <- apply(gbpre.mat, 1, function(gb) any(gb > 0.5))

This works, but what thelatemail has suggested is actually faster. Let me show you with an example. First, I create a fairly large sample matrix:

set.seed(1435)
gbpre.mat <- matrix(runif(600000,0,0.7), ncol = 6)
head(gbpre.mat)
##            [,1]        [,2]       [,3]       [,4]         [,5]       [,6]
## [1,] 0.34588950 0.548891207 0.14621109 0.64827636 0.2132974880 0.08318449
## [2,] 0.08258421 0.504511182 0.15966061 0.65975977 0.0009340659 0.18353030
## [3,] 0.01970881 0.004321273 0.51373098 0.58779409 0.1166218414 0.55205101
## [4,] 0.16150403 0.134012891 0.19062268 0.68766140 0.4341565775 0.46083298
## [5,] 0.32099279 0.371436278 0.13317573 0.02674299 0.4670175053 0.47581938
## [6,] 0.50144544 0.579256903 0.03034916 0.56547615 0.0091638700 0.42943656

and then I use both ways to get the rows, where at least one number is larger than 0.5 and measure the time:

system.time(ind <- apply(gbpre.mat, 1, function(gb) any(gb > 0.5)))
##    user  system elapsed 
##   0.218   0.008   0.228 
system.time(ind2 <- rowSums(gbpre.mat > 0.5) > 0)
##    user  system elapsed 
##   0.008   0.000   0.008

There is a clear winner here. The results are identical:

identical(ind, ind2)
## [1] TRUE

I also want to add some clarification on why your code was slow. Let me just run your definition of ind over the first 600 rows of the matrix:

system.time(ind3 <- apply(gbpre.mat[1:600, ], 1, function(gb) any(gbpre.mat > 0.5)))
##    user  system elapsed 
##   3.011   0.461   3.479 

You see that I also use the whole matrix gbpre.mat inside the function. Running this over only 600 lines takes 3.5 seconds, the calculation for the entire matrix would take about one hour. And it would be wrong: you would get a vector of TRUE only, because you actually checked many times whether there is a single value larger then 0.5 somewhere in the entire matrix.

Upvotes: 2

Related Questions