thesixmax
thesixmax

Reputation: 164

Calculating value for unique column entries using dplyr

I have not been able to find a similar question across SO. I am quite new to dplyr in general.

Consider a toy data frame of some sales statistics

df <- data.frame(
  added = c("2020-10-05", "2020-10-30", "2020-11-04", "2020-12-10", "2020-12-14"),
  closed = c("", "2020-11-05", "2020-12-10", "", ""),
  value = c(100, 200, 300, 400, 500),
  stage = c("Quote", "Won", "Lost", "Quote", "Quote")
)

It contains the date a deal was added, when it was closed, the deal value and the current stage of the deal. If a deal is not won or lost, it does not have a closed date. I want to calculate the pipeline value, i.e. the value of deals which are not yet won or lost, for each unique entry in the added and closed column (since these are the only times where the pipeline value can change).

I have gotten it to work through a mix of base R and dplyr. First I create a data frame of dates:

date_df <- as.data.frame(seq(as.Date("2020-10-01"),as.Date(Sys.Date()),1))
colnames(date_df) <- c("date")

My choice of starting date is arbitrary as long as it is before the first date in the added column. Then I evaluate this expression for each entry in date_df through a loop:

library(tidyverse)
pipeline <- c()
for (i in 1:nrow(date_df)) {
    pipeline <-
        df %>%
        filter(
            (added <= date_df$date[i] & closed > date_df$date[i] & closed < Sys.Date()) |
            (added <= date_df$date[i] & stage != "Won" & stage != "Lost")
        ) %>%
        summarise(pipeline = sum(value))
    date_df$pipeline[i] <- pipeline
}

Which correctly gives me:

> date_df
         date pipeline
1  2020-10-01        0
2  2020-10-02        0
3  2020-10-03        0
4  2020-10-04        0
5  2020-10-05      100
6  2020-10-06      100
7  2020-10-07      100
8  2020-10-08      100
9  2020-10-09      100
10 2020-10-10      100
11 2020-10-11      100
12 2020-10-12      100
13 2020-10-13      100
14 2020-10-14      100
15 2020-10-15      100
16 2020-10-16      100
17 2020-10-17      100
18 2020-10-18      100
19 2020-10-19      100
20 2020-10-20      100
21 2020-10-21      100
22 2020-10-22      100
23 2020-10-23      100
24 2020-10-24      100
25 2020-10-25      100
26 2020-10-26      100
27 2020-10-27      100
28 2020-10-28      100
29 2020-10-29      100
30 2020-10-30      300
31 2020-10-31      300
32 2020-11-01      300
33 2020-11-02      300
34 2020-11-03      300
35 2020-11-04      600
36 2020-11-05      400
37 2020-11-06      400
38 2020-11-07      400
39 2020-11-08      400
40 2020-11-09      400
41 2020-11-10      400
42 2020-11-11      400
43 2020-11-12      400
44 2020-11-13      400
45 2020-11-14      400
46 2020-11-15      400
47 2020-11-16      400
48 2020-11-17      400
49 2020-11-18      400
50 2020-11-19      400
51 2020-11-20      400
52 2020-11-21      400
53 2020-11-22      400
54 2020-11-23      400
55 2020-11-24      400
56 2020-11-25      400
57 2020-11-26      400
58 2020-11-27      400
59 2020-11-28      400
60 2020-11-29      400
61 2020-11-30      400
62 2020-12-01      400
63 2020-12-02      400
64 2020-12-03      400
65 2020-12-04      400
66 2020-12-05      400
67 2020-12-06      400
68 2020-12-07      400
69 2020-12-08      400
70 2020-12-09      400
71 2020-12-10      500
72 2020-12-11      500
73 2020-12-12      500
74 2020-12-13      500
75 2020-12-14     1000
76 2020-12-15     1000
77 2020-12-16     1000
78 2020-12-17     1000
79 2020-12-18     1000
80 2020-12-19     1000
81 2020-12-20     1000
82 2020-12-21     1000

Basically, I sum the value of open deals for each date. However, this appears to me as a very inefficient approach. The real data set is quite extensive, and to evaluate at each date takes quite some time.

I am only really interested in the expression to be evaluated for each unique entry in added and closed as described above. Are there any elegant solutions with the use of dplyr which accomplishes this, (preferably) without creating a new object? The idea is to later plot it and send it to a shiny app, so a reduction in computing time could be crucial.

Thanks in advance!

EDIT

To show an example for two dates 2020-11-04 and 2020-11-05:

On 2020-11-04, three deals (entry 1-3 in the data) are not either Lost or Won, hence the pipeline value for that date is 600.

On 2020-11-05, one deal is now Lost (second entry in the data), and the pipeline value drops to 400.

So, when a deal is closed, the pipeline value decreases, and when a deal is added, the pipeline value increases.

Values for all dates are added to the original question.

Upvotes: 2

Views: 68

Answers (2)

AnilGoyal
AnilGoyal

Reputation: 26238

This approach will also be useful

result <- dff %>% mutate(id = row_number(),
                         added = as.Date(added),
                         closed = as.Date(closed)) %>%
  pivot_longer(cols = c("added", "closed"), names_to = "activity", values_to = "dates") %>%
  mutate(activity = factor(activity, levels = c("added", "closed"), ordered = T)) %>%
  arrange(dates, activity) %>%
  mutate(val = cumsum(value*case_when(activity == "added" ~ 1,
                                      activity == "closed" ~ -1,
                                      TRUE ~ 0))) %>%
  group_by(dates) %>% summarise(val = min(val))

# A tibble: 7 x 2
  dates        val
  <date>     <dbl>
1 2020-10-05   100
2 2020-10-30   300
3 2020-11-04   600
4 2020-11-05   400
5 2020-12-10   500
6 2020-12-14  1000
7 NA             0

dput(dff) used

> dput(dff)
structure(list(added = structure(c(18540, 18565, 18570, 18606, 
18610), class = "Date"), closed = structure(c(NA, 18571, 18606, 
NA, NA), class = "Date"), value = c(100, 200, 300, 400, 500), 
    stage = c("Quote", "Won", "Lost", "Quote", "Quote")), row.names = c(NA, 
-5L), class = "data.frame")

If you want to plot the results use complete and fill as

plot <- result %>% filter(!is.na(dates))  %>%
  complete(dates = seq.Date(min(dates), max(dates), by = "day")) %>%
  fill(val) %>%
  ggplot() +
  geom_line(aes(x = dates, y = val))

plot

enter image description here

Upvotes: 1

Abdessabour Mtk
Abdessabour Mtk

Reputation: 3888

I don't get why u're using the closed < Sys.Date() do you have future closed deals, if so this code will account for it:

df %>%
 mutate(across(closed:added, lubridate::ymd), cl = closed < Sys.Date()) %>%
  pivot_longer(c(closed, added), values_to="date") %>%
   filter(!is.na(date)) %>%
    group_by(date) %>%
     summarise(pipeline=sum((stage=="Quote" | cl) *value)) %>%
      mutate(pipeline=cumsum(pipeline)) 
# A tibble: 7 x 2
  date       summed
  <date>      <dbl>
1 2020-10-05    100
2 2020-11-05    300
3 2020-10-30    500
4 2020-12-10   1200
5 2020-11-04   1500
6 2020-12-10   2200
7 2020-12-14   2700

Upvotes: 1

Related Questions