Dinho
Dinho

Reputation: 724

Grouping values into quantiles based on their % value and visualize in R?

I have 2 fields that have % value which shows how aligned the string in field A is to field B using a fuzzy match method and dividing that distance value by the length of field A.

Example:

  R_Number                     A                     B distance len_A len_B  A_percent  B_percent
1      ABC             Microsoft             Microsoft        0     9     9 0.00000000 0.00000000
2      CBD Microsoft Corporation         Microsoft INC       11    21    13 0.52380952 0.84615385
3      DDV        Microsoft Corp        Microsoft corp        1    14    14 0.07142857 0.07142857
4      ABC         Microsoft inc             Microsoft        4    13     9 0.30769231 0.44444444
5      CBD             Microsoft Microsoft Corporation       12     9    21 1.33333333 0.57142857
6      DDV         Microsoft INC        Microsoft Corp        4    13    14 0.30769231 0.28571429

What I would like to do is group by R_Number into quantiles of 0.25, 0.5, 0.75 and 1 based on the % value in A_percent and B_percent into a graph/and-or table.

My desired output would look something like this where each quantile shows the % of values that fell in each quantile region:

 R_Number `25%`  `50%` `75%` `100%`
  <chr>    <dbl>  <dbl> <dbl>  <dbl>
1 ABC      0.143 0.286  0.297  0.308
2 CBD      0.706 0.889  1.11   1.33 
3 DDV      0     0.0357 0.130  0.308

Any help would be great - newish to R.

Upvotes: 0

Views: 69

Answers (1)

Alexlok
Alexlok

Reputation: 3134

This kind of operation can be handled with tidyverse functions.

library(tidyverse)

# dummy data
DF <- tibble(R_value = rep(LETTERS[1:3], each=3),
             A_percent = runif(9))
DF
# A tibble: 9 x 2
#  R_value A_percent
#  <chr>       <dbl>
#1 A           0.965
#2 A           0.926
#3 A           0.835
#4 B           0.361
#5 B           1.00 
#6 B           0.366
#7 C           0.153
#8 C           0.917
#9 C           0.307

So first we want to find the quantile for each value of A_percent. We can use cut() for that. I'm not sure how you want to include B_percent along with A_percent.

DF %>%
  mutate(A_quantile = cut(A_percent,
                          breaks = c(0,.25,.5,.75,1),
                          labels = c("25%","50%","75%","100%")))

Then we want to count the number of rows from each R_number in each quantile.

DF %>%
  mutate(A_quantile = cut(A_percent,
                          breaks = c(0,.25,.5,.75,1),
                          labels = c("25%","50%","75%","100%"))) %>%
  group_by(R_value, A_quantile) %>%
  summarize(n = n())

Finally we can make it into a wide data frame, to have the quantiles as separate columns.

DF %>%
  mutate(A_quantile = cut(A_percent,
                          breaks = c(0,.25,.5,.75,1),
                          labels = c("25%","50%","75%","100%"))) %>%
  group_by(R_value, A_quantile) %>%
  summarize(n = n()) %>%
  pivot_wider(names_from=A_quantile,
              values_from=n,
              values_fill=0)
# A tibble: 3 x 4
# Groups:   R_value [3]
#  R_value `100%` `50%` `25%`
#  <chr>    <int> <int> <int>
#1 A            3     0     0
#2 B            1     2     0
#3 C            1     1     1

Upvotes: 1

Related Questions