datazang
datazang

Reputation: 1169

How to manipulate (aggregate) the data in R?

I have a data set as I've shown below:

df <- tribble(
  ~id,  ~price, ~number_of_book,        
  "1",    10,         3,        
  "1",     5,         1,         
  "2",     7,         4,
  "2",     6,         2, 
  "2",     3,         4,
  "3",     4,         1,
  "4",     5,         1,
  "4",     6,         1,
  "5",     1,         2,
  "5",     9,         3,
)

As you see in the data set, there are 3 books which cost 10 dollar for each book if id is "1" and 1 book that costs 5 dollar. Basically, I want to see the share (%) the number of books for each price bin. Here is my desired data set:

df <- tribble(
  ~id,    ~less_than_three,   ~three-five,  ~five-six, ~more_than_six,     
  "1",          "0%",              "25%",     "0%",         "75%",
  "2",          "0%",              "40%",     "20%",        "40%",
  "3",          "0%",              "100%",    "0%",         "0%",  
  "4",          "0%",              "50%",     "50%",        "0%",
  "5",          "40%",             "0%",      "0%",         "60%",
)

Now, I clustered the prices first. To do this, I run the below code:

out <- cut(df$price, breaks = c(0, 3, 5, 6, 10),
           labels = c("<3","3-5","5-6", ">6")) 

out = table(out) / sum(table(out)) 

But unfortunately, I could not go further because of lack of coding knowledge. Would you help me to get the desired data?

Upvotes: 3

Views: 218

Answers (2)

IceCreamToucan
IceCreamToucan

Reputation: 28675

With dplyr, you can add a column cols which will be used for the column names. Then you can sum the number of books for each col in each id. Next you can compute the percent by dividing these numbers by the sum for that id, then applying scales::percent for formatting as a percent rather than decimal. Now you just need to pivot_wider giving the variables from which to get the names and values, and reorder the columns to match the original label order. (This is a little more involved than the other answer since it accounts for the case when there is >1 row for a given (id, cols/interval) pair, and janitor simplifies things)

labels = c("less_than_three","three_to_five","five_to_six", "more_than_six")

df %>% 
  group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>% 
  summarise(n = sum(number_of_book)) %>% 
  group_by(id) %>% 
  mutate(pct = scales::percent(n/sum(n), 1)) %>% 
  pivot_wider(id_cols = id, names_from = cols, values_from = pct) %>% 
  select_at(c('id', labels)) %>% 
  ungroup

# # A tibble: 5 x 5
#   id    less_than_three three_to_five five_to_six more_than_six
#   <chr> <chr>           <chr>         <chr>       <chr>        
# 1 1     NA              25%           NA          75%          
# 2 2     40%             NA            20%         40%          
# 3 3     NA              100%          NA          NA           
# 4 4     NA              50%           50%         NA           
# 5 5     40%             NA            NA          60%       

If you want to replace the NAs with 0% (which I think makes sense in this context, and matches the output shown in the question), you can use the method mentioned in the comment below.

df %>% 
  group_by(id, cols = cut(price, breaks = c(0, 3, 5, 6, 10), labels = labels)) %>% 
  summarise(n = sum(number_of_book)) %>% 
  group_by(id) %>% 
  mutate(pct = scales::percent(n/sum(n), 1)) %>% 
  pivot_wider(id_cols = id, names_from = cols, values_from = pct,
              values_fill = list(pct = '0%')) %>% 
  select_at(c('id', labels)) %>% 
  ungroup

# # A tibble: 5 x 5
#   id    less_than_three three_to_five five_to_six more_than_six
#   <chr> <chr>           <chr>         <chr>       <chr>        
# 1 1     0%              57%           0%          43%          
# 2 2     40%             0%            20%         40%          
# 3 3     0%              100%          0%          0%           
# 4 4     0%              50%           50%         0%           
# 5 5     40%             0%            0%          60%         

Upvotes: 1

M--
M--

Reputation: 28825

We can use cut to get the intervals and then using tidyr transform data to wide format and at the end using janitor add the percentages.

library(dplyr)
library(tidyr)
library(janitor)

df %>% 
  mutate(interval = cut(price, c(0,3,5,6,Inf))) %>% 
  select(-price) %>% 
  pivot_wider(names_from = interval, values_from = number_of_book) %>% 
  adorn_percentages()

#>  id (6,Inf] (3,5] (5,6] (0,3]
#>   1    0.75  0.25    NA    NA
#>   2    0.40    NA   0.2   0.4
#>   3      NA  1.00    NA    NA
#>   4      NA  0.50   0.5    NA
#>   5    0.60    NA    NA   0.4

Upvotes: 3

Related Questions