nicnhus
nicnhus

Reputation: 31

SQL performance issues with window functions on daily basis

Given ~23 million users, what is the most efficient way to compute the cumulative number of logins within the last X months for any given day (even when no login was performed) ? Start date of a customer is its first ever login, end date is today.

Desired output

c_id        day            nb_logins_past_6_months
----------------------------------------------
1           2019-01-01     10
1           2019-01-02     10
1           2019-01-03     9
...
1           today          5

➔ One line per user per day with the number of logins between current day and 179 days in the past

Approach 1

 1. Cross join each customer ID with calendar table
 2. Left join on login table on day
 3. Compute window function (i.e. `sum(nb_logins) over (partition by c_id order by day rows between 179 preceding and current row)`)

+ Easy to understand and mantain
- Really heavy, quite impossible to run on daily basis
- Incremental does not bring much benefit : still have to go 179 days in the past

Approach 2

 1. Cross join each customer ID with calendar table 
 2. Left join on login table on day between today and 179 days in the past
 3. Group by customer ID and day to get nb logins within 179 days

+ Easier to do incremental
- Table at step 2 is exceeding 300 billion rows

What is the common way to deal with this knowing this is not the only use case, we have to compute other columns like this (nb logins in the past 12 months etc.)

Upvotes: 1

Views: 1500

Answers (2)

Rick James
Rick James

Reputation: 142298

For performance, don't do the entire task all at once. Instead, gather subtotals at the end of each month (or day or whatever makes sense for your data). Then SUM up the subtotals to provide the 'report'.

More discussion (with a focus on MySQL): http://mysql.rjweb.org/doc.php/summarytables

(You should tag questions with the specific product; different products have different syntax/capability/performance/etc.)

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269843

In standard SQL, you would use:

select l.*,
       count(*) over (partition by customerid
                      order by login_date
                      range between interval '6 month' preceding and current row
                     ) as num_logins_180day
from logins l;

This assumes that the logins table has a date of the login with no time component.

I see no reason to multiply 23 million users by 180 days to generate a result set in excess of 4 million rows to answer this question.

Upvotes: 1

Related Questions