julio514
julio514

Reputation: 187

R: How to select rows based on criteria applied to each cells of each rows

I have a data frame of ~14000 rows by ~580 columns. Each cell contains an expression value (RNA expression data). I converted each value of the df to percentages based on the sum of each columns.

Now the thing I'd like to do is to exclude rows for which all elements have a value lower than 0.005. Just to be clear, if all but one element have values lower than 0.005, the row will be kept.

I managed to perform this task by writing two imbricated loops iterating through all rows and columns of the data frame. But it is very slow to complete.

Here is my code:

  # Create empty data frame in which rows meeting criteria will be written.
  df <- data.frame(matrix(ncol = ncol(tData2_perc), nrow = 0))
  colnames(df) <- colnames(tData2_perc)
  passed = 0
  # Start loop. tData2_perc is the data frame containing all the perc. values.
  for( i in 1:nrow(tData2_perc)){
     for( j in 1:ncol(tData2_perc)){
        if(tData2_perc[i,j] >= 0.0005){
           passed = 1
        }
     }
     if(passed == 1){
        df = rbind(df, tData2_perc[i,])
     }
     passed = 0
  }

Is there a more elegant (and computationally faster?) way of doing this? I tried using apply, but couldn't find a way to implement it... Thanks!

Edit: Here is a subset of my data (dput() output):

structure(list(S002ED2S5MID86 = c(0.00506787330316742,0.000542986425339366, 
0.000723981900452489, 0.0191855203619909, 0.00452488687782805, 
0, 0, 0, 0, 0), AcBarrieBulk10120130703 = c(0.00729498574543015, 
0.000419252054335066, 0.00117390575213819, 0.025071272849237, 
0.00721113533456314, 0, 0, 0, 0, 0), PelisserRhizo30520130703 =     c(0.0093628088426528, 
0.00182054616384915, 0.00182054616384915, 0.0280884265279584, 
0.00572171651495449, 0, 0, 0, 0, 0), S002F76S3MID96 =  c(0.000578452639190166, 
0.000144613159797542, 0.00101229211858279, 0.0190889370932755, 
0.00289226319595083, 0, 0.000144613159797542, 0, 0.000144613159797542, 
0), S002ED0S3MID102 = c(0.249181043896047, 0.0437504549756133, 
0.118293659459853, 0.0249690616582951, 0.0470990754895538, 0, 
0, 0.000218388294387421, 0, 0)), .Names = c("S002ED2S5MID86", 
"AcBarrieBulk10120130703", "PelisserRhizo30520130703", "S002F76S3MID96", 
"S002ED0S3MID102"), row.names = c(1L, 2L, 3L, 4L, 5L, 4001L, 
4002L, 4003L, 4004L, 4005L), class = "data.frame")

Upvotes: 0

Views: 93

Answers (1)

Dean MacGregor
Dean MacGregor

Reputation: 18416

First make a dummy column that takes the pmax of all the other columns. Then filter by that column. You can then delete the dummy column

tData2_perc$filt<-do.call(pmax, tData2_perc)
df<-tData2_perc[tData2_perc$filt>.005,]
tData2_perc$filt<-NULL

If you want to exclude rows with more than 1 exception then do the following.

Make a dummy column that is the sum of columns that meet (or don't meet your criteria). Then subset based on the number of columns that meet your specification.

tData2_perc$filt<-apply(tData2_perc, 1, function(x) sum(x>0.005)) #you can change the greater than to less than if you want to invert the count.
df<-tData2_perc[tData2_perc$filt>=2,] #the 2 is made up by me for the case of wanting 2 or more columns that are .005 or greater.  Change the 2 for your needs
tData2_perc$filt<-NULL #deleting dummy columns
df$filt<-NULL

Upvotes: 1

Related Questions