Reputation: 187
I have a data frame of ~14000 rows by ~580 columns. Each cell contains an expression value (RNA expression data). I converted each value of the df to percentages based on the sum of each columns.
Now the thing I'd like to do is to exclude rows for which all elements have a value lower than 0.005. Just to be clear, if all but one element have values lower than 0.005, the row will be kept.
I managed to perform this task by writing two imbricated loops iterating through all rows and columns of the data frame. But it is very slow to complete.
Here is my code:
# Create empty data frame in which rows meeting criteria will be written.
df <- data.frame(matrix(ncol = ncol(tData2_perc), nrow = 0))
colnames(df) <- colnames(tData2_perc)
passed = 0
# Start loop. tData2_perc is the data frame containing all the perc. values.
for( i in 1:nrow(tData2_perc)){
for( j in 1:ncol(tData2_perc)){
if(tData2_perc[i,j] >= 0.0005){
passed = 1
}
}
if(passed == 1){
df = rbind(df, tData2_perc[i,])
}
passed = 0
}
Is there a more elegant (and computationally faster?) way of doing this? I tried using apply, but couldn't find a way to implement it... Thanks!
Edit: Here is a subset of my data (dput() output):
structure(list(S002ED2S5MID86 = c(0.00506787330316742,0.000542986425339366,
0.000723981900452489, 0.0191855203619909, 0.00452488687782805,
0, 0, 0, 0, 0), AcBarrieBulk10120130703 = c(0.00729498574543015,
0.000419252054335066, 0.00117390575213819, 0.025071272849237,
0.00721113533456314, 0, 0, 0, 0, 0), PelisserRhizo30520130703 = c(0.0093628088426528,
0.00182054616384915, 0.00182054616384915, 0.0280884265279584,
0.00572171651495449, 0, 0, 0, 0, 0), S002F76S3MID96 = c(0.000578452639190166,
0.000144613159797542, 0.00101229211858279, 0.0190889370932755,
0.00289226319595083, 0, 0.000144613159797542, 0, 0.000144613159797542,
0), S002ED0S3MID102 = c(0.249181043896047, 0.0437504549756133,
0.118293659459853, 0.0249690616582951, 0.0470990754895538, 0,
0, 0.000218388294387421, 0, 0)), .Names = c("S002ED2S5MID86",
"AcBarrieBulk10120130703", "PelisserRhizo30520130703", "S002F76S3MID96",
"S002ED0S3MID102"), row.names = c(1L, 2L, 3L, 4L, 5L, 4001L,
4002L, 4003L, 4004L, 4005L), class = "data.frame")
Upvotes: 0
Views: 93
Reputation: 18416
First make a dummy column that takes the pmax
of all the other columns. Then filter by that column. You can then delete the dummy column
tData2_perc$filt<-do.call(pmax, tData2_perc)
df<-tData2_perc[tData2_perc$filt>.005,]
tData2_perc$filt<-NULL
If you want to exclude rows with more than 1 exception then do the following.
Make a dummy column that is the sum of columns that meet (or don't meet your criteria). Then subset based on the number of columns that meet your specification.
tData2_perc$filt<-apply(tData2_perc, 1, function(x) sum(x>0.005)) #you can change the greater than to less than if you want to invert the count.
df<-tData2_perc[tData2_perc$filt>=2,] #the 2 is made up by me for the case of wanting 2 or more columns that are .005 or greater. Change the 2 for your needs
tData2_perc$filt<-NULL #deleting dummy columns
df$filt<-NULL
Upvotes: 1