Yellow_truffle
Yellow_truffle

Reputation: 923

Finding a ratio of variable using group_by and showing it in dataframe format

Hi I have a data frame which has ID and topic_number (top) as shown below:

structure(list(ID = c(1013515, 1013515, 1013515, 1013515, 1018446, 
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 
1018446, 1018446), top = c(2L, 8L, 8L, 4L, 2L, 7L, 3L, 7L, 7L, 
1L, 8L, 7L, 1L, 2L, 3L, 7L, 1L, 2L, 1L, 1L, 8L, 7L, 7L, 1L, 1L, 
2L, 2L, 1L, 7L, 3L, 2L, 2L, 7L, 3L, 2L, 3L, 3L, 7L, 7L, 7L, 2L, 
2L, 3L, 2L, 7L, 8L, 1L, 7L, 2L, 7L, 8L, 7L, 4L, 7L, 2L, 7L, 8L, 
2L, 2L, 7L, 2L, 4L, 3L, 7L, 7L, 7L, 2L, 3L, 7L, 2L, 1L, 7L, 3L, 
7L, 3L, 2L, 7L, 4L, 8L, 2L, 2L, 2L, 7L, 3L)), row.names = c(NA, 
-84L), class = c("tbl_df", "tbl", "data.frame"))

I know that I have 8 topics in general and what I want to do is find that for each ID, what percentage is for which topic and I want to show this in column format. So I want to see 8 new columns generated and in each column the ratio is shown, so for example:

ID        total   topic_1   topic_2   topic_3   topic_4   topic_5   topic_6   topic_7   topic_8
1013515     4         0       0.25       0        0.25       0         0         0        0.5
1018446     80      0.125     0.275     0.15     0.037       0         0        0.337     0.075

I started with:

data <- dd %>% group_by(ID) %>%  count(top)

but don't know how to change this to ratio and generate the columns

Upvotes: 2

Views: 52

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388982

  • count number of rows for each ID and top value.
  • For each ID calculate the percentage/ratio and sum the count value
  • Get the data in wide format.
library(dplyr)
library(tidyr)

dd %>%
  count(ID, top) %>%
  group_by(ID) %>%
  mutate(perc = prop.table(n), 
         n = sum(n)) %>%
  ungroup() %>%
  arrange(top) %>%
  pivot_wider(names_from = top, values_from = perc, 
              names_prefix = 'topic_', values_fill = 0)

#       ID     n topic_1 topic_2 topic_3 topic_4 topic_7 topic_8
#    <dbl> <int>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#1 1018446    80   0.125   0.275    0.15  0.0375   0.338   0.075
#2 1013515     4   0       0.25     0     0.25     0       0.5  

Upvotes: 3

Related Questions