Jack
Jack

Reputation: 165

How to subset data frames based on top quartile of each column?

#let's make some sample data first
names<- c("t1","t2","t3","t4","t5","t1","t2","t3","t4","t5","t1","t2","t3","t4","t5")
metric1_set1 <- c(2.5,3.1,4.5,2.5,12,7.1,8.5,10,10.1,17.8,12.3,11,10,14,1.5) 
metric1_set2 <- c(2.1,3.1,4.15,2.5,10,7.1,8.5,10,10.1,17.1,12.3,17.3,8,11,1.5) 
metric1_set3 <- c(12.1,13.1,4.15,2.5,10.5,7.1,2.5,10,7.1,11.1,12.3,17.3,8,1.45,1.5) 
dataset1 <- data.frame(names,metric1_set1,metric1_set2,metric1_set3)


names<- c("t1","t2","t3","t4","t5","t1","t2","t3","t4","t5","t1","t2","t3","t4","t5")
metric2_set1 <- c(21.5,13.1,4.5,2.5,12,7.1,8.5,10,10.1,17.8,12.3,11,10,14,1.5) 
metric2_set2 <- c(12.1,3.1,4.15,2.5,10,7.1,8.5,10,8.1,17.1,12.3,17.3,8,1.1,1.5) 
metric2_set3 <- c(2.1,13.1,4.15,2.5,10.5,7.1,21.5,10,7.1,11.1,12.3,12.3,8,1.45,1.5) 
dataset2 <- data.frame(names,metric2_set1,metric2_set2,metric2_set3)

Now the issue is to calculate the top quartile for each column of dataset1 and then pull out the corresponding names from dataset2. The idea is to get the correlation between these subsetted values.

quantiles <- apply(dataset1[2:4], 2, quantile, na.rm = TRUE)

Would obtain quartiles but the actual question is how to save names associated with let's say top qunatile of one dataset and drop every other row from both datasets.

Based on what @sconfluentus suggested we can change it to:

 topQuartile<-function(x){   #the function 
 y=quantile(x,  na.rm = TRUE )
 z=y[3]
 return(z)
 }
 quartile_daatset1<- apply( dataset1[2:4] , 2 , topQuartile  )

This perfectly works but I also need something similar to the following:

 topquartile_set1 <- subset(dataset1$metric1_set1, subset=(dataset1$metric1_set1 <= quant_daatset1[1]))

I need similar code that works for each column and puts all subsets together in a single final data frame.

Upvotes: 0

Views: 1607

Answers (2)

lebelinoz
lebelinoz

Reputation: 5068

I would start by gathering the data using the tidyr package:

library(tidyr)
df.gathered = gather(dataset1, key = "category", value = "value", -names)

Result:

names  category    value
--------------------------
 t1 metric1_set1  2.50
 t2 metric1_set1  3.10
 t3 metric1_set1  4.50
 t4 metric1_set1  2.50
 t5 metric1_set1 12.00
 t1 metric1_set1  7.10
 t2 metric1_set1  8.50
 t3 metric1_set1 10.00
 t4 metric1_set1 10.10
 t5 metric1_set1 17.80 
 ...  # and similar rows for metric1_set2 and metric1_set3 ...

You can then use the group_by feature in dplyr to get the top quantile from each name and category:

library(dplyr)
df.gathered %>% group_by(names, category) %>% summarise(Q1 = quantile(value, 1))

names   category    Q1
----------------------------
  t1 metric1_set1  12.3
  t1 metric1_set2  12.3
  t1 metric1_set3  12.3
  t2 metric1_set1  11.0
  t2 metric1_set2  17.3
  t2 metric1_set3  17.3
  ...

Upvotes: 0

sconfluentus
sconfluentus

Reputation: 4993

The simplest way is to build a function with quantile in it, extract the fifth quantile within that function and return it to the apply like so:

fifthQuantile<-function(x){
  y=quantile(x,  na.rm = TRUE )
  z=y[5]
  return(z)
}

 quantiles<- apply( dataset1[2:4] , 2 , fifthQuantile )

This returns a data frame with your old column names as row names. If you would prefer that they are shaped the other way, try:

quantiles<- t(apply( dataset1[2:4] , 2 , fifthQuantile ))

This gives you a transposed data frame, with the columns where they were in the original!

Upvotes: 0

Related Questions