Jonathan Devereux
Jonathan Devereux

Reputation: 175

Sum unique day count within a date ranges

I have a list of duplicate date ranges where I need to sum all dates unique to a campaign dates, without duplicating overlapping days.

For example in the data below, there are days in campaign_a running between 08/07/2022 and 15/07/2022 which overlap the 11/07/2022 and 13/07/2022 which need to not duplicate when summing.

The summing also needs to take the campaign name as a conditional.

campaign days line_start line_end
campaign_a 108 14/07/2022 30/10/2022
campaign_a 61 31/10/2022 31/12/2022
campaign_a 2 11/07/2022 13/07/2022
campaign_a 2 8/07/2022 15/07/2022
campaign_a 108 14/07/2022 30/10/2022
campaign_a 61 31/10/2022 31/12/2022
campaign_a 2 11/07/2022 13/07/2022
campaign_a 2 8/07/2022 10/07/2022
campaign_b 108 14/07/2022 30/10/2022
campaign_b 61 31/10/2022 31/12/2022
campaign_b 2 11/07/2022 13/07/2022
campaign_b 2 8/07/2022 10/07/2022
campaign_b 108 14/07/2022 30/10/2022
campaign_b 61 31/10/2022 31/12/2022
campaign_b 2 11/07/2022 13/07/2022
campaign_b 2 8/07/2022 10/07/2022
campaign_b 108 14/07/2022 30/10/2022
campaign_b 61 31/10/2022 31/12/2022
campaign_b 2 11/07/2022 13/07/2022
campaign_b 2 8/07/2022 10/07/2022

Upvotes: 1

Views: 357

Answers (2)

Jos Woolley
Jos Woolley

Reputation: 9062

Again, assuming "campaign_a", for example, in G2:

=LET(
    δ, A$2:D$21,
    ζ, FILTER(δ, INDEX(δ, , 1) = G2),
    α, INDEX(ζ, , 3),
    β, INDEX(ζ, , 4),
    ξ, SEQUENCE(MAX(β) - MIN(α) + 1, , MIN(α)),
    γ, BYROW(ξ, LAMBDA(λ, SUM(BYROW(CHOOSE({1, 2}, α, β), 
       LAMBDA(κ, N(MEDIAN(κ, λ) = λ)))))),
       COUNT(FILTER(ξ, γ > 0))
)

Copy down to give similar results for campaigns in G3, G4, etc.

Explanation

The best way to understand this approach is to consider a simple example like the following one using numbers:

Campaing Start End
campaing_a 1 1
campaing_a 1 3
campaing_a 5 5
campaing_a 5 5
campaing_a 6 6

What it does is build a grid with a timeframe from 1 to 6. This is the SEQUENCE part of the formula (it is represented in the D1:H1 range see next screenshot):

 SEQUENCE(MAX(β) - MIN(α) + 1, , MIN(α))

The MEDIAN calculation is a way of filling the grid (schedule) with 1 or 0 values. It ensures that from the start to the end of each row is filled with 1's. The calculation is as follows:

MEDIAN(start, end, counter)

Let's say for the interval [2,4] we can check the following output based on different counters (from 1 to 5):

MEDIAN(2,4,1) -> 2 <> 1 (out of the interval)
MEDIAN(2,4,2) -> 2 =  2
MEDIAN(2,4,3) -> 3 =  3
MEDIAN(2,4,4) -> 4 =  4 
MEDIAN(2,4,5) -> 4 <> 5 (out of the interval)

So if counter is within the range of [start, end], then the MEDIAN returns counter. Therefore:

MEDIAN(start, end, counter) = counter

is a way to generate 1 for the entire period from start to end in the grid, otherwise returns 0.

Note: It is important to consider that this solution assumes a non-empty start or end value, otherwise it gives a wrong result, because MEDIAN(B1, C1, counter) if B1 and C1 are empty it returns counter.

Here is the graphical representation of the process described before:

grid sample

The SUM on each column counts the number of overlaps per column (per day). Because we don't want to count the overlaps as part of the duration, we just need to count if the sum is bigger than 0. For example:

=FILTER(D1:I1,D7:I7>0) -> [1,2,3,5,6]

so the total duration is 5 (counting the number of items of FILTER output)

The rest is to put this logic in Excel in a way you can iterate over all elements.

Note: If the range of the grid is too wide, let's say you have three items:

Item   Start End
item1  1     2
item2  3     4
item3  5     1000

it generates a grid from 1-1000 elements just for calculating the total duration of three items, so take this into consideration that for a large data set with such characteristics it may impact the performance.

Upvotes: 3

David Leal
David Leal

Reputation: 6769

Basically, this problem is similar to calculating the project's total duration (in days). Where in this case each campaign's data can be assimilated as project tasks with start and end dates. The only difference here is that gaps don't contribute to project duration.

Let's illustrate it with the following example, since dates are represented in Excel as a positive integer, to simplify the calculation we can use just numbers as follow (Sample Data 1):

sample gannt chart

From the graph is easy to identify that the total duration will be 13 days. Notice that the gap in columns K and P doesn't count for the total duration.

Based on this graphical representation we can build a logic for calculating the durations as we progress from the top of the Gantt chart to the bottom.

The logic to calculate the duration is based on identifying overlapping duration for each task. The condition to identify two intervals A,B overlap is the following:

AND(startA <= endB, endA >= startB)

we are going to iterate over all tasks and on each step update the new start and end dates and update the duration. Let's say we are going to calculate the new start and end dates for two consecutive rows represented by A and B intervals. The combined interval will be:

[MIN(startA, startB), MAX(endA, endB)]

If the previous duration is represented by the variable duration, then the updated duration will be:

IF(endB > endA, duration + (endB - endA), duration)))

i.e, if the next interval (B) ends after A, then we need to update the duration with the difference (endB - endA).

If the intervals don't overlap, then the duration will be updated as follow:

duration = duration + endB - startB + 1

and the next start and end date will be the startB, endB for the next calculation.

The above process can now be put in Excel terms as follow:

=LET(set, $A$2:$C$12, campaing, T2, campaings, INDEX(set,,1),
 subset, SORT(FILTER(set, campaings=campaing),2),starts, INDEX(subset,,2),
 ends, INDEX(subset,,3),intervals, starts&","&ends,
 SPLIT, LAMBDA(x, 1*TEXTSPLIT(x,",")), JOIN, LAMBDA(y, TEXTJOIN(",",,y)),
 calc, REDUCE("0,0,0", intervals, LAMBDA(acc,i, LET(accValues, SPLIT(acc),
   startA, INDEX(accValues,1,1), endA, INDEX(accValues,1,2),
   duration, INDEX(accValues,1,3), values, SPLIT(i), startB, INDEX(values,1,1), 
   endB, INDEX(values,1,2),
   IF(duration=0,JOIN(HSTACK(startB, endB, endB-startB+1)),
    IF(AND(startA <= endB, endA >= startB),
      JOIN(HSTACK(MIN(startA, startB), MAX(endA, endB),
        IF(endB > endA, duration + (endB-endA), duration))),
      JOIN(HSTACK(startB, endB, duration + endB-startB+1))
    ))))), INDEX(SPLIT(calc),1,3)
)

and here is the output:

output for Sample Data 1

Back to the original input data from the question for both campaigns, the output will be:

output for the sample data of the question

Formula Explanation

Now that we know the logic to calculate the total duration, let's explain the excel implementation.

We use LET function to define the required variables and to avoid repeating the same calculation.

In order to be able to build the logic explained before, we need to sort the input data based on the start date.

For the calculation, we need to use an Excel function that allows using a cumulative calculation based on the previous iteration. We can use it for that REDUCE or SCAN. We use REDUCE because at the end we need a scalar number with the total, not an array, but SCAN is a good option for debugging purposes. In order to do that you can replace REPLACE with SCAN and the following line at the end: INDEX(SPLIT(calc),,3) with: calc.

For Sample Data 1 this will be SCAN output, so you can check in each iteration the CVS values stored in the accumulator (start, end, duration)

1,2,2
1,4,4
1,4,4
5,6,6
7,7,7
9,10,9
11,11,10
11,11,10
12,12,11
14,14,12
14,15,13

Because on every iteration, we need to track three values: start date, end date, and duration, but REDUCE function iterates over a single element of the input array. We are going to build a record in string format via Comma-separated-value format (CSV) and then extract the information back in every iteration. In order to do that we define the following two user LAMBDA functions:

SPLIT, LAMBDA(x, 1*TEXTSPLIT(x,","))
JOIN, LAMBDA(y, TEXTJOIN(",",,y))

Notice because we are dealing with numbers we ensure the SPLIT function returns numbers, not texts as it is the output of TEXTSPLIT. We do the conversion by multiplying the output of TEXTSPLIT by 1.

The accumulator (acc) was initialized as all values equal to zero (0,0,0), so for the first iteration we just assign the information from the first row (start, end, and duration), then we iterate following the logic explained before and use variables to extract back the information stored in CSV format.

Array Version

Here is the formula for the array version, that returns the unique days for all the campaigns:

=LET(set, $W$2:$Y$21, campaings, INDEX(set,,1), campaingsUx, UNIQUE(campaings),
 SPLIT, LAMBDA(x, 1*TEXTSPLIT(x,",")), JOIN, LAMBDA(y, TEXTJOIN(",",,y)),
 byRowResult, BYROW(campaingsUx, LAMBDA(campaing, LET(subset, 
  SORT(FILTER(set, campaings=campaing),2),starts, INDEX(subset,,2),
  ends, INDEX(subset,,3),intervals, starts&","&ends,
  calc, REDUCE("0,0,0", intervals, LAMBDA(acc,i, LET(accValues, SPLIT(acc),
    startA, INDEX(accValues,1,1), endA, INDEX(accValues,1,2),
    duration, INDEX(accValues,1,3), values, SPLIT(i),
    startB, INDEX(values,1,1),  endB, INDEX(values,1,2),
    IF(duration=0,JOIN(HSTACK(startB, endB, endB-startB+1)),
     IF(AND(startA <= endB, endA >= startB),
       JOIN(HSTACK(MIN(startA, startB), MAX(endA, endB),
         IF(endB > endA, duration + (endB-endA), duration))),
       JOIN(HSTACK(startB, endB, duration + endB-startB+1))))
  ))), INDEX(SPLIT(calc),1,3)
 ))), HSTACK(campaingsUx, byRowResult)
)

And the output will be:

campaign_a  177
campaign_b  177

Upvotes: 1

Related Questions