Reputation: 724
I have 2 fields that have % value which shows how aligned the string in field A is to field B using a fuzzy match method and dividing that distance value by the length of field A.
Example:
R_Number A B distance len_A len_B A_percent B_percent
1 ABC Microsoft Microsoft 0 9 9 0.00000000 0.00000000
2 CBD Microsoft Corporation Microsoft INC 11 21 13 0.52380952 0.84615385
3 DDV Microsoft Corp Microsoft corp 1 14 14 0.07142857 0.07142857
4 ABC Microsoft inc Microsoft 4 13 9 0.30769231 0.44444444
5 CBD Microsoft Microsoft Corporation 12 9 21 1.33333333 0.57142857
6 DDV Microsoft INC Microsoft Corp 4 13 14 0.30769231 0.28571429
What I would like to do is group by R_Number into quantiles of 0.25, 0.5, 0.75 and 1 based on the % value in A_percent and B_percent into a graph/and-or table.
My desired output would look something like this where each quantile shows the % of values that fell in each quantile region:
R_Number `25%` `50%` `75%` `100%`
<chr> <dbl> <dbl> <dbl> <dbl>
1 ABC 0.143 0.286 0.297 0.308
2 CBD 0.706 0.889 1.11 1.33
3 DDV 0 0.0357 0.130 0.308
Any help would be great - newish to R.
Upvotes: 0
Views: 69
Reputation: 3134
This kind of operation can be handled with tidyverse functions.
library(tidyverse)
# dummy data
DF <- tibble(R_value = rep(LETTERS[1:3], each=3),
A_percent = runif(9))
DF
# A tibble: 9 x 2
# R_value A_percent
# <chr> <dbl>
#1 A 0.965
#2 A 0.926
#3 A 0.835
#4 B 0.361
#5 B 1.00
#6 B 0.366
#7 C 0.153
#8 C 0.917
#9 C 0.307
So first we want to find the quantile for each value of A_percent
. We can use cut()
for that. I'm not sure how you want to include B_percent
along with A_percent
.
DF %>%
mutate(A_quantile = cut(A_percent,
breaks = c(0,.25,.5,.75,1),
labels = c("25%","50%","75%","100%")))
Then we want to count the number of rows from each R_number in each quantile.
DF %>%
mutate(A_quantile = cut(A_percent,
breaks = c(0,.25,.5,.75,1),
labels = c("25%","50%","75%","100%"))) %>%
group_by(R_value, A_quantile) %>%
summarize(n = n())
Finally we can make it into a wide data frame, to have the quantiles as separate columns.
DF %>%
mutate(A_quantile = cut(A_percent,
breaks = c(0,.25,.5,.75,1),
labels = c("25%","50%","75%","100%"))) %>%
group_by(R_value, A_quantile) %>%
summarize(n = n()) %>%
pivot_wider(names_from=A_quantile,
values_from=n,
values_fill=0)
# A tibble: 3 x 4
# Groups: R_value [3]
# R_value `100%` `50%` `25%`
# <chr> <int> <int> <int>
#1 A 3 0 0
#2 B 1 2 0
#3 C 1 1 1
Upvotes: 1