Sum unique day count within a date ranges

Question

I have a list of duplicate date ranges where I need to sum all dates unique to a campaign dates, without duplicating overlapping days.

For example in the data below, there are days in campaign_a running between 08/07/2022 and 15/07/2022 which overlap the 11/07/2022 and 13/07/2022 which need to not duplicate when summing.

The summing also needs to take the campaign name as a conditional.

campaign	days	line_start	line_end
campaign_a	108	14/07/2022	30/10/2022
campaign_a	61	31/10/2022	31/12/2022
campaign_a	2	11/07/2022	13/07/2022
campaign_a	2	8/07/2022	15/07/2022
campaign_a	108	14/07/2022	30/10/2022
campaign_a	61	31/10/2022	31/12/2022
campaign_a	2	11/07/2022	13/07/2022
campaign_a	2	8/07/2022	10/07/2022
campaign_b	108	14/07/2022	30/10/2022
campaign_b	61	31/10/2022	31/12/2022
campaign_b	2	11/07/2022	13/07/2022
campaign_b	2	8/07/2022	10/07/2022
campaign_b	108	14/07/2022	30/10/2022
campaign_b	61	31/10/2022	31/12/2022
campaign_b	2	11/07/2022	13/07/2022
campaign_b	2	8/07/2022	10/07/2022
campaign_b	108	14/07/2022	30/10/2022
campaign_b	61	31/10/2022	31/12/2022
campaign_b	2	11/07/2022	13/07/2022
campaign_b	2	8/07/2022	10/07/2022

Jos Woolley · Accepted Answer

Again, assuming "campaign_a", for example, in G2:

=LET(
    δ, A$2:D$21,
    ζ, FILTER(δ, INDEX(δ, , 1) = G2),
    α, INDEX(ζ, , 3),
    β, INDEX(ζ, , 4),
    ξ, SEQUENCE(MAX(β) - MIN(α) + 1, , MIN(α)),
    γ, BYROW(ξ, LAMBDA(λ, SUM(BYROW(CHOOSE({1, 2}, α, β), 
       LAMBDA(κ, N(MEDIAN(κ, λ) = λ)))))),
       COUNT(FILTER(ξ, γ > 0))
)

Copy down to give similar results for campaigns in G3, G4, etc.

Explanation

The best way to understand this approach is to consider a simple example like the following one using numbers:

Campaing	Start	End
campaing_a	1	1
campaing_a	1	3
campaing_a	5	5
campaing_a	5	5
campaing_a	6	6

What it does is build a grid with a timeframe from 1 to 6. This is the SEQUENCE part of the formula (it is represented in the D1:H1 range see next screenshot):

 SEQUENCE(MAX(β) - MIN(α) + 1, , MIN(α))

The MEDIAN calculation is a way of filling the grid (schedule) with 1 or 0 values. It ensures that from the start to the end of each row is filled with 1's. The calculation is as follows:

MEDIAN(start, end, counter)

Let's say for the interval [2,4] we can check the following output based on different counters (from 1 to 5):

MEDIAN(2,4,1) -> 2 <> 1 (out of the interval)
MEDIAN(2,4,2) -> 2 =  2
MEDIAN(2,4,3) -> 3 =  3
MEDIAN(2,4,4) -> 4 =  4 
MEDIAN(2,4,5) -> 4 <> 5 (out of the interval)

So if counter is within the range of [start, end], then the MEDIAN returns counter. Therefore:

MEDIAN(start, end, counter) = counter

is a way to generate 1 for the entire period from start to end in the grid, otherwise returns 0.

Note: It is important to consider that this solution assumes a non-empty start or end value, otherwise it gives a wrong result, because MEDIAN(B1, C1, counter) if B1 and C1 are empty it returns counter.

Here is the graphical representation of the process described before:

The SUM on each column counts the number of overlaps per column (per day). Because we don't want to count the overlaps as part of the duration, we just need to count if the sum is bigger than 0. For example:

=FILTER(D1:I1,D7:I7>0) -> [1,2,3,5,6]

so the total duration is 5 (counting the number of items of FILTER output)

The rest is to put this logic in Excel in a way you can iterate over all elements.

Note: If the range of the grid is too wide, let's say you have three items:

Item   Start End
item1  1     2
item2  3     4
item3  5     1000

it generates a grid from 1-1000 elements just for calculating the total duration of three items, so take this into consideration that for a large data set with such characteristics it may impact the performance.

Sum unique day count within a date ranges

Answers (2)

Explanation

Formula Explanation

Array Version

Related Questions