OAK
OAK

Reputation: 3166

SQL Find Last 30 Days records count grouped by

I am trying to retrieve the count of customers daily per each status in a dynamic window - Last 30 days. The result of the query should show each day how many customers there are per each customer status (A,B,C) for the Last 30 days (i.e today() - 29 days). Every customer can have one status at a time but change from one status to another within the customer lifetime. The purpose of this query is to show customer 'movement' across their lifetime. I've generated a series of date ranging from the first date a customer was created until today.

I've put together the following query but it appears that something I'm doing is incorrect because the results depict most days as having the same count across all statuses which is not possible, each day new customers are created. We checked with another simple query and confirmed that the split between statuses is not equal.

I tried to depict below the data and the SQL I use to reach the optimal result.

Starting point (example table customer_statuses):

   customer_id | status | created_at 
---------------------------------------------------
    abcdefg1234   B      2019-08-22
    abcdefg1234   C      2019-01-17 
    ...   
    abcdefg1234   A      2018-01-18 
    bcdefgh2232   A      2017-09-02
    ghijklm4950   B      2018-06-06

statuses - A,B,C There is no sequential order for the statuses, a customer can have any status at the start of the business relationship and switch between statuses throughout their lifetime.

table customers:

    id        |      f_name      |      country      |    created_at |
---------------------------------------------------------------------
abcdefg1234      Michael                 FR              2018-01-18
bcdefgh2232      Sandy                   DE              2017-09-02
....
ghijklm4950      Daniel                  NL              2018-06-06

SQL - current version:

WITH customer_list AS (
    SELECT
    DISTINCT a.id,
    a.created_at
    FROM
    customers a
),
dates AS (
     SELECT 
    generate_series(
        MIN(DATE_TRUNC('day', created_at)::DATE),
        MAX(DATE_TRUNC('day', now())::DATE),
        '1d'
        )::date AS day
     FROM customers a
), 
customer_statuses AS (
       SELECT
      customer_id,
      status,
      created_at,
      ROW_NUMBER() OVER 
      (
      PARTITION BY customer_id
      ORDER BY created_at DESC
      ) col
    FROM
        customer_status
)
SELECT
   day,
    (
    SELECT
    COUNT(DISTINCT id) AS accounts
    FROM customers 
    WHERE created_at::date BETWEEN day - 29 AND day
   ),
   status
FROM dates d
    LEFT JOIN customer_list cus
    ON d.day = cus.created_at
    LEFT JOIN customer_statuses cs 
    ON cus.id = cs.customer_id
WHERE
    cs.col = 1
GROUP BY 1,3
ORDER BY 1 DESC,3 ASC

Currently what the results from the query look like:

  day    | count | status
-------------------------
2020-01-24   1230     C
2020-01-24   1230     B
2020-01-24   1230     A
2020-01-23   1200     C
2020-01-23   1200     B
2020-02-23   1200     A
2020-02-22   1150     C
2020-02-22   1150     B
...
2017-01-01    50      C
2017-01-01    50      B
2017-01-01    50      A

Two things I've noticed from the results above - most of the time the results show the same count across all statuses in a given day. The second observation, there are days that only two statuses appear - which should not be the case. If now new accounts are created in a given day with a certain status, the count of the previous day should be carried over - right? or is this the problem with the query I created or with the logic I have in mind?? Perhaps I'm expecting a result that will not happen logically?

Required result:

    day    | count | status
-------------------------
2020-01-24   1230     C
2020-01-24   1000     B
2020-01-24   2500     A
2020-01-23   1200     C
2020-01-23   1050     B
2020-02-23   2450     A
2020-02-22   1160     C
2020-02-22   1020     B
2020-02-22   2400     A
...
2017-01-01    10      C
2017-01-01    4       B
2017-01-01   50       A

Thank You!

Upvotes: 0

Views: 1208

Answers (2)

OAK
OAK

Reputation: 3166

I've altered the query a bit because I've noticed that I get duplicate records on the days a customer changes a status - one record with the old status and one records for the new day.

For example output with @Gordon's query:

        dte       |  status 
---------------------------
    2020-02-12         B
    ...                ...
    01.02.2020         A
    01.02.2020         B
    31.01.2020         A
    30.01.2020         A

I've adapted the query, see below, while the results depict the changes between statuses correctly (no duplicate records on the day of change), however, the records continue up until now()::date - interval '1day' and not include now()::date (as in today). I'm not sure why and can't find the correct logic to ensure all of this is how I want it. Dates correctly depict the status of each customer and the status returned include today.

Adjusted query:

select gs.dte, cs.status, count(*)
from (select cs.*,
             lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) - INTERVAL '1day' as next_ca
      from customer_statuses cs
     ) cs cross join lateral
     generate_series(cs.created_at, cs.next_ca, interval '1 day') gs(dte)
where gs.dte < now()::date - interval '30 day'

The two adjustments: The adjustments also seem counter-intuitive as it seems i'm taking the interval day away from one part of the query only to add it to another (which to me seems to yield the same result)

a - added the decrease of 1 day from the lead function (line 3)

lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) - INTERVAL '1 day' as next_ca

b - removed the decrease of 1 day from the next_ca variable (line 6)

generate_series(cs.created_at, cs.next_ca - interval '1 day', interval '1 day')

Example of the output with the adjusted query:

       dte       |  status 
---------------------------
    2020-02-11         B
    ...                ...
    01.02.2020         B
    31.01.2020         A
    30.01.2020         A

Thanks for your help!

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1270553

Your query seems overly complicated. Here is another approach:

  • Use lead() to get when the status ends for each customer status record.
  • Use generate_series() to generate the days.

The rest is just filtering and aggregation:

select gs.dte, cs.status, count(*)
from (select cs.*,
             lead(cs.created_at, 1, now()::date) over (partition by cs.customer_id order by cs.created_at) as next_ca
      from customer_statuses cs
     ) cs cross join lateral
     generate_series(cs.created_at, cs.next_ca - interval '1 day', interval '1 day') gs(dte)
where gs.dte < now()::date - interval '30 day'

Upvotes: 1

Related Questions