Reputation: 667
I have a dataframe which has sales at a ppg , product level, I want to find out how many products contribute towards particular % (ex 75%) of the sale like testing a pareto principle.
The data is
df= structure(list(Ppg = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1",
"p2"), class = "factor"), product = structure(c(1L, 2L, 3L, 4L,
1L, 2L, 3L), .Label = c("A", "B", "C", "D"), class = "factor"),
sales = c(50, 40, 30, 80, 100, 70, 30)), .Names = c("Ppg",
"product", "sales"), row.names = c(NA, -7L), class = "data.frame")
> df
Ppg product sales
1 p1 A 50
2 p1 B 40
3 p1 C 30
4 p1 D 80
5 p2 A 100
6 p2 B 70
7 p2 C 30
I retrieved the cumulative sum using dplyr
df %>% group_by(Ppg) %>% summarise(sale = sum(sales) %>% mutate(c1 = cumsum(sales))
Ppg product sales c1
<fctr> <fctr> <dbl> <dbl>
1 p1 A 50 50
2 p1 B 40 90
3 p1 C 30 120
4 p1 D 80 200
5 p2 A 100 100
6 p2 B 70 170
7 p2 C 30 200
Is there any way to
i) calculate proportion of sales (based on cumsum)
ii) How many distinct products contributed toward certain % of sales.
exmple for ppg p1, 2 distinct products (A & B combind give 75 % of sales)
so finally something like below would be ideal
ppg Number_Products_towards_75%
p1 2
p2 1
Upvotes: 1
Views: 791
Reputation: 1547
Assuming you're fine using the order that the product are currently in to get you answer (since reordering the rows would get you different results):
For 1, you can get the result with an extra mutate. Just divide the cumulative sum by the sum of all the sales in that group:
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales))
Gets you:
# A tibble: 7 x 5
# Groups: Ppg [2]
Ppg product sales c1 percent
<fctr> <fctr> <dbl> <dbl> <dbl>
1 p1 A 50.0 50.0 0.250
2 p1 B 40.0 90.0 0.450
3 p1 C 30.0 120 0.600
4 p1 D 80.0 200 1.00
5 p2 A 100 100 0.500
6 p2 B 70.0 170 0.850
7 p2 C 30.0 200 1.00
For 2, you could then use a mutate to add a column for if that product is below the threshold and summarize to count the products below the threshold (and then add one to the count since one more would get you over it).
threshold <- 0.5
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales)) %>%
mutate(isbelowthreshold = percent < threshold) %>% # add a column for if it's below the threshold
summarize(count = sum(isbelowthreshold) + 1) # we need to add one since one extra product will put you over the threshold
gets you:
# A tibble: 2 x 2
Ppg count
<fctr> <dbl>
1 p1 3.00
2 p2 1.00
But again this depends on the order of the products. Consider ordering them from highest to lowest value first? Something like
df %>%
group_by(Ppg) %>%
arrange(Ppg, desc(sales))
Upvotes: 2