Neil
Neil

Reputation: 8247

how to find quantile of grouped variable in dplyr

I have following dataframe in r

No.        Key             Category
1          ABC123           0R1D
2          ABC567           0R1D
3          DEF444           1R1D
4          FRT433           1R1D
5          FRT433           1R1D
6          TYU412           2R2D
7          BEC123           0R1D
8          BCY567           0R1D
9          DEO444           1R1D
10         FRJ433           1R1D
11         FRK433           1R1D
12         TYL412           2R2D

I want to find unique key across all the categories and 4 quantiles of the same. I am doing following in R

truck_quartile <- df %>% 
   group_by(Category) %>% 
   summarise(No_of_trailers = n_distinct(key)) %>% 
   do(data.frame(t(quantile(.$No_of_trailers, probs = c(0.25, 0.50, 0.75, 1))))) %>% 
   as.data.frame()

But it gives me only one row as I am summarizing it before quantiles.

Upvotes: 1

Views: 718

Answers (1)

acylam
acylam

Reputation: 18661

As mentioned in the comments, you will only get a single vector of quantiles with "unique number of vehicles falls in different categories" as expected. Below is for the case where you want to calculate the quantiles of the number of times each unique vehicle appears in each category.

library(dplyr)

truck_quartile <- df %>% 
  group_by(Category, Key) %>%
  summarize(No_of_trailers = n()) %>%
  group_by(Category) %>%
  do(data.frame(t(quantile(.$No_of_trailers, probs = c(0.25, 0.50, 0.75, 1))))) %>% 
  as.data.frame() %>%
  setNames(c("Category", "25%", "50%", "75%", "100%"))

Result with original df:

  Category 25% 50% 75% 100%
1     0R1D   1   1   1    1
2     1R1D   1   1   1    2
3     2R2D   1   1   1    1

Your original df is a bit unfortunate since it only has one vehicle with duplicates in the same category. So I created df_long by sampling df with replacement.

Result with df_long:

  Category 25% 50% 75% 100%
1     0R1D   1   3   4    5
2     1R1D   3   4   6   11
3     2R2D   1   2   2    4

Note: Probably not a good idea to name variables with only numbers and special symbols, but it's fine if you only want a good looking table without actually using the columns for further calculation.

Data:

library(data.table)

df = fread("No.        Key             Category
           1          ABC123           0R1D
           2          ABC567           0R1D
           3          DEF444           1R1D
           4          FRT433           1R1D
           5          FRT433           1R1D
           6          TYU412           2R2D
           7          BEC123           0R1D
           8          BCY567           0R1D
           9          DEO444           1R1D
           10         FRJ433           1R1D
           11         FRK433           1R1D
           12         TYL412           2R2D")

set.seed(123)
df_long = data.frame(Key = sample(df$Key, 100, replace = TRUE),
                     Category = sample(df$Category, 100, replace = TRUE))

Upvotes: 1

Related Questions