av abhishiek
av abhishiek

Reputation: 667

Calculating cumulative proportion sales product count

I have a dataframe which has sales at a ppg , product level, I want to find out how many products contribute towards particular % (ex 75%) of the sale like testing a pareto principle.

The data is

    df= structure(list(Ppg = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1", 
"p2"), class = "factor"), product = structure(c(1L, 2L, 3L, 4L, 
1L, 2L, 3L), .Label = c("A", "B", "C", "D"), class = "factor"), 
    sales = c(50, 40, 30, 80, 100, 70, 30)), .Names = c("Ppg", 
"product", "sales"), row.names = c(NA, -7L), class = "data.frame")





  > df
  Ppg product sales
1  p1       A    50
2  p1       B    40
3  p1       C    30
4  p1       D    80
5  p2       A   100
6  p2       B    70
7  p2       C    30

I retrieved the cumulative sum using dplyr

df %>% group_by(Ppg) %>% summarise(sale = sum(sales) %>% mutate(c1 = cumsum(sales))

          Ppg product sales    c1
  <fctr>  <fctr> <dbl> <dbl>
1     p1       A    50    50
2     p1       B    40    90
3     p1       C    30   120
4     p1       D    80   200
5     p2       A   100   100
6     p2       B    70   170
7     p2       C    30   200

Is there any way to

i) calculate proportion of sales (based on cumsum)

ii) How many distinct products contributed toward certain % of sales.

exmple for ppg p1, 2 distinct products (A & B combind give 75 % of sales)

so finally something like below would be ideal

ppg Number_Products_towards_75%
p1   2
p2   1

Upvotes: 1

Views: 791

Answers (1)

Jacqueline Nolis
Jacqueline Nolis

Reputation: 1547

Assuming you're fine using the order that the product are currently in to get you answer (since reordering the rows would get you different results):

For 1, you can get the result with an extra mutate. Just divide the cumulative sum by the sum of all the sales in that group:

df %>% 
  group_by(Ppg) %>% 
  mutate(c1 = cumsum(sales)) %>%
  mutate(percent = c1 / sum(sales))

Gets you:

# A tibble: 7 x 5
# Groups: Ppg [2]
  Ppg    product sales    c1 percent
  <fctr> <fctr>  <dbl> <dbl>   <dbl>
1 p1     A        50.0  50.0   0.250
2 p1     B        40.0  90.0   0.450
3 p1     C        30.0 120     0.600
4 p1     D        80.0 200     1.00 
5 p2     A       100   100     0.500
6 p2     B        70.0 170     0.850
7 p2     C        30.0 200     1.00 

For 2, you could then use a mutate to add a column for if that product is below the threshold and summarize to count the products below the threshold (and then add one to the count since one more would get you over it).

threshold <- 0.5

df %>% 
  group_by(Ppg) %>% 
  mutate(c1 = cumsum(sales)) %>%
  mutate(percent = c1 / sum(sales)) %>%
  mutate(isbelowthreshold = percent < threshold) %>% # add a column for if it's below the threshold
  summarize(count = sum(isbelowthreshold) + 1) # we need to add one since one extra product will put you over the threshold

gets you:

# A tibble: 2 x 2
  Ppg    count
  <fctr> <dbl>
1 p1      3.00
2 p2      1.00

But again this depends on the order of the products. Consider ordering them from highest to lowest value first? Something like

df %>% 
    group_by(Ppg) %>%
    arrange(Ppg, desc(sales))

Upvotes: 2

Related Questions