Reputation: 165
#let's make some sample data first
names<- c("t1","t2","t3","t4","t5","t1","t2","t3","t4","t5","t1","t2","t3","t4","t5")
metric1_set1 <- c(2.5,3.1,4.5,2.5,12,7.1,8.5,10,10.1,17.8,12.3,11,10,14,1.5)
metric1_set2 <- c(2.1,3.1,4.15,2.5,10,7.1,8.5,10,10.1,17.1,12.3,17.3,8,11,1.5)
metric1_set3 <- c(12.1,13.1,4.15,2.5,10.5,7.1,2.5,10,7.1,11.1,12.3,17.3,8,1.45,1.5)
dataset1 <- data.frame(names,metric1_set1,metric1_set2,metric1_set3)
names<- c("t1","t2","t3","t4","t5","t1","t2","t3","t4","t5","t1","t2","t3","t4","t5")
metric2_set1 <- c(21.5,13.1,4.5,2.5,12,7.1,8.5,10,10.1,17.8,12.3,11,10,14,1.5)
metric2_set2 <- c(12.1,3.1,4.15,2.5,10,7.1,8.5,10,8.1,17.1,12.3,17.3,8,1.1,1.5)
metric2_set3 <- c(2.1,13.1,4.15,2.5,10.5,7.1,21.5,10,7.1,11.1,12.3,12.3,8,1.45,1.5)
dataset2 <- data.frame(names,metric2_set1,metric2_set2,metric2_set3)
Now the issue is to calculate the top quartile for each column of dataset1 and then pull out the corresponding names from dataset2. The idea is to get the correlation between these subsetted values.
quantiles <- apply(dataset1[2:4], 2, quantile, na.rm = TRUE)
Would obtain quartiles but the actual question is how to save names associated with let's say top qunatile of one dataset and drop every other row from both datasets.
Based on what @sconfluentus suggested we can change it to:
topQuartile<-function(x){ #the function
y=quantile(x, na.rm = TRUE )
z=y[3]
return(z)
}
quartile_daatset1<- apply( dataset1[2:4] , 2 , topQuartile )
This perfectly works but I also need something similar to the following:
topquartile_set1 <- subset(dataset1$metric1_set1, subset=(dataset1$metric1_set1 <= quant_daatset1[1]))
I need similar code that works for each column and puts all subsets together in a single final data frame.
Upvotes: 0
Views: 1607
Reputation: 5068
I would start by gathering the data using the tidyr
package:
library(tidyr)
df.gathered = gather(dataset1, key = "category", value = "value", -names)
Result:
names category value
--------------------------
t1 metric1_set1 2.50
t2 metric1_set1 3.10
t3 metric1_set1 4.50
t4 metric1_set1 2.50
t5 metric1_set1 12.00
t1 metric1_set1 7.10
t2 metric1_set1 8.50
t3 metric1_set1 10.00
t4 metric1_set1 10.10
t5 metric1_set1 17.80
... # and similar rows for metric1_set2 and metric1_set3 ...
You can then use the group_by
feature in dplyr
to get the top quantile from each name and category:
library(dplyr)
df.gathered %>% group_by(names, category) %>% summarise(Q1 = quantile(value, 1))
names category Q1
----------------------------
t1 metric1_set1 12.3
t1 metric1_set2 12.3
t1 metric1_set3 12.3
t2 metric1_set1 11.0
t2 metric1_set2 17.3
t2 metric1_set3 17.3
...
Upvotes: 0
Reputation: 4993
The simplest way is to build a function with quantile
in it, extract the fifth quantile within that function and return it to the apply like so:
fifthQuantile<-function(x){
y=quantile(x, na.rm = TRUE )
z=y[5]
return(z)
}
quantiles<- apply( dataset1[2:4] , 2 , fifthQuantile )
This returns a data frame with your old column names as row names. If you would prefer that they are shaped the other way, try:
quantiles<- t(apply( dataset1[2:4] , 2 , fifthQuantile ))
This gives you a transposed data frame, with the columns where they were in the original!
Upvotes: 0