Jinyong Huang
Jinyong Huang

Reputation: 49

r loop for filtering through each column

I have a data frame like this: gene expression data frame Assuming column name as different samples and row name as different genes. Now I want to know the number of genes left after I filter from each column with a number For example,

sample1_more_than_5 <- df[(df[,1]>5),]
sample1_more_than_10 <- df[(df[,1]>10),]
sample1_more_than_20 <- df[(df[,1]>20),]
sample1_more_than_30 <- df[(df[,1]>30),]

Then,

sample2_more_than_5 <- df[(df[,2]>5),]
sample2_more_than_10 <- df[(df[,2]>10),]
sample2_more_than_20 <- df[(df[,2]>20),]
sample2_more_than_30 <- df[(df[,2]>30),]

But I don't want to repeat this 100 times as I have 100 samples. Can anyone write a loop for me for this situation? Thank you

Upvotes: 3

Views: 732

Answers (2)

Here is a solution using two loops that calculates, by each sample (columns), the number of genes (rows) that have a value greater than the one indicated in the nums vector.

#Create the vector with the numbers used to filter each columns
nums<-c(5, 10, 20, 30)

#Loop for each column
resul <- apply(df, 2, function(x){
  #Get the length of rows that have a higher value than each nums entry
  sapply(nums, function(y){
    length(x[x>y])
  })
})

#Transform the data into a data.frame and add the nums vector in the first column
resul<-data.frame(greaterthan = nums,
                  as.data.frame(resul))

Upvotes: 3

akrun
akrun

Reputation: 886948

We can loop over the columns and do this and create the grouping with cut

lst1 <- lapply(df, function(x) split(x, cut(x, breaks = c(5, 10, 20, 30))))

or findInterval and then split

lst1 <- lapply(df, function(x) split(x, findInterval(x,  c(5, 10, 20, 30))))

If we go by the way the objects are created in the OP's post, there would be 100 * 4 i.e. 400 objects (100 columns) in the global environment. Instead, it can be single list object.

The objects can be created, but it is not recommended

v1 <- c(5, 10, 20, 30)
v2 <- seq_along(df)
 for(i in v2) {
     for(j in v1) {
      assign(sprintf('sample%d_more_than_%d', i, j), 
               value = df[df[,i] > j,, drop = FALSE])
    }
  } 

Upvotes: 3

Related Questions