Reputation: 923
Hi I have a data frame which has ID and topic_number (top) as shown below:
structure(list(ID = c(1013515, 1013515, 1013515, 1013515, 1018446,
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446,
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446,
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446,
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446,
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446,
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446,
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446,
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446,
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446,
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446,
1018446, 1018446, 1018446, 1018446, 1018446, 1018446, 1018446,
1018446, 1018446), top = c(2L, 8L, 8L, 4L, 2L, 7L, 3L, 7L, 7L,
1L, 8L, 7L, 1L, 2L, 3L, 7L, 1L, 2L, 1L, 1L, 8L, 7L, 7L, 1L, 1L,
2L, 2L, 1L, 7L, 3L, 2L, 2L, 7L, 3L, 2L, 3L, 3L, 7L, 7L, 7L, 2L,
2L, 3L, 2L, 7L, 8L, 1L, 7L, 2L, 7L, 8L, 7L, 4L, 7L, 2L, 7L, 8L,
2L, 2L, 7L, 2L, 4L, 3L, 7L, 7L, 7L, 2L, 3L, 7L, 2L, 1L, 7L, 3L,
7L, 3L, 2L, 7L, 4L, 8L, 2L, 2L, 2L, 7L, 3L)), row.names = c(NA,
-84L), class = c("tbl_df", "tbl", "data.frame"))
I know that I have 8 topics in general and what I want to do is find that for each ID, what percentage is for which topic and I want to show this in column format. So I want to see 8 new columns generated and in each column the ratio is shown, so for example:
ID total topic_1 topic_2 topic_3 topic_4 topic_5 topic_6 topic_7 topic_8
1013515 4 0 0.25 0 0.25 0 0 0 0.5
1018446 80 0.125 0.275 0.15 0.037 0 0 0.337 0.075
I started with:
data <- dd %>% group_by(ID) %>% count(top)
but don't know how to change this to ratio and generate the columns
Upvotes: 2
Views: 52
Reputation: 388982
count
number of rows for each ID
and top
value.ID
calculate the percentage/ratio and sum
the count valuelibrary(dplyr)
library(tidyr)
dd %>%
count(ID, top) %>%
group_by(ID) %>%
mutate(perc = prop.table(n),
n = sum(n)) %>%
ungroup() %>%
arrange(top) %>%
pivot_wider(names_from = top, values_from = perc,
names_prefix = 'topic_', values_fill = 0)
# ID n topic_1 topic_2 topic_3 topic_4 topic_7 topic_8
# <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1018446 80 0.125 0.275 0.15 0.0375 0.338 0.075
#2 1013515 4 0 0.25 0 0.25 0 0.5
Upvotes: 3