Niccola Tartaglia
Niccola Tartaglia

Reputation: 1667

Filter/Subset R dateframe based on other dataframe

I have two dataframes one with a corr matrix and one with the corresponding p-values (i.e. both are ordered, each position in each dataframe corresponds to the same variable). I would like to filter elements in the corr dataframe based on 2 conditions:

1) compare each column to the element in the first row of that column, elements smaller should become NA or 0.

2) if the corresponding element in the 2nd dataframe is larger than say 0.05, the element should again become 0.

An example would look like this:

set.seed(2)
a = sample(runif(5)  ,rep=TRUE) 
b = sample(runif(5)  ,rep=TRUE)
c = sample(runif(5)  ,rep=TRUE)
corr_mat = data.frame(a,b,c)       

a = sample(runif(5,0,0.1)  ,rep=TRUE) 
b = sample(runif(5,0,0.1)  ,rep=TRUE)
c = sample(runif(5,0,0.1)  ,rep=TRUE)
p_values= data.frame(a,b,c)   

So I would like to subset corr_mat in a way that in column a only those value remain that are larger then the row 1 in column a AND the corresponding p-value in p_values is smaller than 0.05.

This is the output I would like to have for these values:

 > corr_mat
      a         b         c
 1 0.9438393 0.4052822 0.8368892
 2 0.1848823 0.4052822 0.6618988
 3 0.9438393 0.2388948 0.3875495
 4 0.5733263 0.7605133 0.3472722
 5 0.5733263 0.5526741 0.6618988

 > p_values
        a          b          c
 1 0.086886104 0.01632009 0.02754012
 2 0.051428176 0.09440418 0.09297202
 3 0.016464224 0.02970107 0.02754012
 4 0.086886104 0.01150841 0.09297202
 5 0.001041453 0.09440418 0.09297202

Target output (based on 1st condition, bigger than or equal to first row value for each column):

 > corr_mat
      a         b         c
 1 0.9438393 0.4052822 0.8368892
 2           0.4052822  
 3 0.9438393            
 4           0.7605133  
 5           0.5526741  

Target output (based on both conditions- now excluding corresponding the p-values larger than 0.05):

 > corr_mat
      a         b         c
 1 0.9438393 0.4052822 0.8368892
 2             
 3 0.9438393            
 4           0.7605133  
 5             

I was thinking something along the lines of:

 apply(corr_mat_df,2, comp)

where comp is defined as something that compares row 1 of column a in corr_mat AND the corresponding element in p_values.

comp<-function(df1,df2) {
for (i in 1:length(df1)) {
if (df[i]<df[1] & df2[i]>0.05){
  df[i]=NA
}
}
}

Upvotes: 1

Views: 305

Answers (2)

akrun
akrun

Reputation: 886948

We could also do by

corr_mat *NA^(corr_mat < corr_mat[1,][col(corr_mat)] | p_values > 0.05 )
#         a         b         c
#1        NA 0.4052822 0.8368892
#2        NA        NA        NA
#3 0.9438393        NA        NA
#4        NA 0.7605133        NA
#5        NA        NA        NA

Or just assign

corr_mat[corr_mat < corr_mat[1,][col(corr_mat)] | p_values > 0.05] <- NA

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388817

We can use mapply to apply both the conditions in one go using replace. We replace the values which satisfies one of the condition with NA.

mapply(function(x, y) replace(x, (x < x[1]) | (y > 0.05), NA),corr_mat, p_values)


#             a         b         c
#[1,]        NA 0.4052822 0.8368892
#[2,]        NA        NA        NA
#[3,] 0.9438393        NA        NA
#[4,]        NA 0.7605133        NA
#[5,]        NA        NA        NA

Upvotes: 2

Related Questions