Reputation: 1
I have the following data frame that I am trying to filter. I only want to retain the rows where at least one value in the row is greater than 0.5. Any help is appreciated. I tried to do the following but my system hangs:
gbpre.mat<-as.matrix(gbpre)
ind <- apply(gbpre.mat, 1, function(gbpre.mat) any(gbpre > 0.5))
PH1544_pre PH1545_pre PH1565_pre PH1571_pre PH1612_pre PH1616_pre
bg00050873 0.88235087 0.6053853 0.6521263 0.2770632 0.82596713 0.635325831
bg00212031 0.01175069 0.1844859 0.4345596 0.2186097 0.03717635 0.670305781
bg00213748 0.64571987 0.7316865 0.4345596 0.5613724 0.81309068 0.900878028
bg00214611 0.04405524 0.7103071 0.6810916 0.6526317 0.03412550 0.008187867
bg00455876 0.72122206 0.1272784 0.2155168 0.4794622 0.70089805 0.668497074
bg01707559 0.03592823 0.3548602 0.2743443 0.2194279 0.57761264 0.061564411
Upvotes: 0
Views: 87
Reputation: 15907
The reason that your definition of ind
does not work is that in the function you apply, you are not using the argument of the function, but rather the whole of gbpre
. If your matrix is large, this might be slow, because for each of the many rows of the matrix the entire large matrix is checked.
To be more specific: This is your definition:
ind <- apply(gbpre.mat, 1, function(gbpre.mat) any(gbpre > 0.5))
You use apply
over rows, which is fine. Then you define a function of one argument. The argument is called gbpre.mat
, which is possible, but I would recommend that you don't use the same name as the variable that you want to pass into the function. This would avoid some confusion. The function does then not even use gbpre.mat
, so the result of the function is independent of it's input. This is not what you want.
So you should rather use the following:
ind <- apply(gbpre.mat, 1, function(gb) any(gb > 0.5))
This works, but what thelatemail has suggested is actually faster. Let me show you with an example. First, I create a fairly large sample matrix:
set.seed(1435)
gbpre.mat <- matrix(runif(600000,0,0.7), ncol = 6)
head(gbpre.mat)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.34588950 0.548891207 0.14621109 0.64827636 0.2132974880 0.08318449
## [2,] 0.08258421 0.504511182 0.15966061 0.65975977 0.0009340659 0.18353030
## [3,] 0.01970881 0.004321273 0.51373098 0.58779409 0.1166218414 0.55205101
## [4,] 0.16150403 0.134012891 0.19062268 0.68766140 0.4341565775 0.46083298
## [5,] 0.32099279 0.371436278 0.13317573 0.02674299 0.4670175053 0.47581938
## [6,] 0.50144544 0.579256903 0.03034916 0.56547615 0.0091638700 0.42943656
and then I use both ways to get the rows, where at least one number is larger than 0.5 and measure the time:
system.time(ind <- apply(gbpre.mat, 1, function(gb) any(gb > 0.5)))
## user system elapsed
## 0.218 0.008 0.228
system.time(ind2 <- rowSums(gbpre.mat > 0.5) > 0)
## user system elapsed
## 0.008 0.000 0.008
There is a clear winner here. The results are identical:
identical(ind, ind2)
## [1] TRUE
I also want to add some clarification on why your code was slow. Let me just run your definition of ind
over the first 600 rows of the matrix:
system.time(ind3 <- apply(gbpre.mat[1:600, ], 1, function(gb) any(gbpre.mat > 0.5)))
## user system elapsed
## 3.011 0.461 3.479
You see that I also use the whole matrix gbpre.mat
inside the function. Running this over only 600 lines takes 3.5 seconds, the calculation for the entire matrix would take about one hour. And it would be wrong: you would get a vector of TRUE
only, because you actually checked many times whether there is a single value larger then 0.5 somewhere in the entire matrix.
Upvotes: 2