psrpsrpsr
psrpsrpsr

Reputation: 457

How do you find and count occurrences of sequences being met?

I have a dataset from 1/2016 to 2/2017 for users, months, home location, visited location, and visit count. Users can visit multiple locations per month, or none. The data set contains multiple rows per month, because users can visit different locations.

Here's the data, and I will then walk you through the Rextester link below.

| user_id | year_month | homelocid | visitlocid | fact_count |
|---------|------------|-----------|------------|------------|
| 1234    | 201601     | L9900     | L9987      | 8          |
| 1234    | 201601     | L9900     | L4456      | 1          |
| 1234    | 201602     | L9900     | L9987      | 5          |
| 1234    | 201603     | L9900     | L9987      | 1          |
| 1234    | 201604     | L9900     | L9987      | 4          |
| 1234    | 201605     | L9900     | L9987      | 3          |
| 1234    | 201605     | L9900     | L4456      | 3          |
| 1234    | 201605     | L9900     | L5567      | 26         |
| 1234    | 201606     | L9900     | L4456      | 1          |
| 1234    | 201606     | L9900     | L9987      | 21         |
| 1234    | 201606     | L9900     | L6678      | 2          |
| 1234    | 201606     | L9900     | L5567      | 1          |
| 1234    | 201607     | L9900     | L9987      | 6          |
| 1234    | 201607     | L9900     | L5567      | 22         |
| 1234    | 201607     | L9900     | L6678      | 25         |
| 1234    | 201608     | NULL      | NULL       | NULL       |
| 1234    | 201609     | L9900     | L9987      | 1          |
| 1234    | 201609     | L9900     | L4456      | 3          |
| 1234    | 201610     | L9900     | L9987      | 1          |
| 1234    | 201611     | NULL      | NULL       | NULL       |
| 1234    | 201612     | NULL      | NULL       | NULL       |
| 1234    | 201701     | NULL      | NULL       | NULL       |
| 1234    | 201702     | L9900     | L9987      | 6          |
| 1234    | 201702     | L9900     | L2334      | 1          |
| 5556    | 201601     | L9900     | L3445      | 1          |
| 5556    | 201602     | L9900     | L3445      | 3          |
| 5556    | 201603     | L9900     | L3445      | 1          |
| 5556    | 201604     | L9900     | L3445      | 21         |
| 5556    | 201605     | L9900     | L3445      | 1          |
| 5556    | 201606     | L9900     | L3445      | 2          |
| 5556    | 201607     | NULL      | NULL       | NULL       |
| 5556    | 201608     | NULL      | NULL       | NULL       |
| 5556    | 201609     | L9900     | L3445      | 1          |
| 5556    | 201610     | NULL      | NULL       | NULL       |
| 5556    | 201611     | NULL      | NULL       | NULL       |
| 5556    | 201612     | L9900     | L3445      | 1          |
| 5556    | 201701     | NULL      | NULL       | NULL       |
| 5556    | 201702     | L9900     | L3445      | 1          |
| 9988    | 201601     | L9900     | L2334      | 27         |
| 9988    | 201602     | L9900     | L3445      | 22         |
| 9988    | 201602     | L9900     | L9987      | 1          |
| 9988    | 201602     | L9900     | L2334      | 2          |
| 9988    | 201603     | L9900     | L4556      | 1          |
| 9988    | 201603     | L9900     | L2334      | 2          |
| 9988    | 201603     | L9900     | L3445      | 21         |
| 9988    | 201603     | L9900     | L9987      | 1          |
| 9988    | 201604     | L9900     | L3445      | 25         |
| 9988    | 201604     | L9900     | L2334      | 1          |
| 9988    | 201604     | L9900     | L4556      | 1          |
| 9988    | 201605     | L9900     | L3445      | 26         |
| 9988    | 201605     | L9900     | L2334      | 1          |
| 9988    | 201606     | L9900     | L3445      | 22         |
| 9988    | 201607     | L9900     | L3445      | 5          |
| 9988    | 201608     | L9900     | L3445      | 6          |
| 9988    | 201609     | L9900     | L3445      | 11         |
| 9988    | 201610     | L9900     | L3445      | 17         |
| 9988    | 201611     | L9900     | L3445      | 9          |
| 9988    | 201612     | L9900     | L2334      | 1          |
| 9988    | 201612     | L9900     | L3445      | 14         |
| 9988    | 201701     | L9900     | L3445      | 14         |
| 9988    | 201701     | L9900     | L2334      | 2          |
| 9988    | 201702     | L9900     | L4556      | 1          |
| 9988    | 201702     | L9900     | L3445      | 9          |

I am trying to calculate the number of times that users had:

  1. Greater than 20 fact_count in 1 month
  2. Greater than 20 fact_count 2 months in a row
  3. Greater than 20 fact_count 3 months in a row

I also need to summarize the data on two levels:

  1. On a user_id level (Denominator = # users, Numerator = # users who had >20 fact_count 1, 2, and 3 mos. in a row)
  2. On a user_id AND month level (Denominator = # months for each users (14), Numerator = # users who had >20 fact_count 1, 2, and 3 for mos. in a row)

Here's the desired outcome (note: not query output, just outcome) that I'm looking for from the data set:

+-------------------+------------+------------------+
|                   | User level | User-month level |
+-------------------+------------+------------------+
| 1 month           | 3/3        | 10/42            |
+-------------------+------------+------------------+
| 2 months in a row | 2/3        | 4/42             |
+-------------------+------------+------------------+
| 3 months in a row | 2/3        | 3/42             |
+-------------------+------------+------------------+

Here is my approach to transforming the data to summarize it correctly, which you can find in this Rextester link: http://rextester.com/DBF40287

  1. The mem_month CTE uses CASE to create a 1 or 0 as 'MET_NOTMET' field based on whether the fact_count >20 was exceeded. At this point there are multiple rows reflecting the different visit location ids.
  2. The mem_month_rank CTE uses RANK to find whether a user had a fact_count of >20 by ranking the CASE statement from mem_month. If a user had three visited locations in a month, only one of which had fact_count >20, this will identify the offending row.

Here is the output from these queries, getting closer to a normalized table:

| mem_month                        |   | mem_month_rank                         |
|--------|------------|------------|---|----------------------------------------|
| user_id| year_month | met_notmet |   | user_id| year_month | met_notmet | rnk |
| 1234   | 201601     | 0          |   | 1234   | 201601     | 0          | 1   |
| 1234   | 201601     | 0          |   | 1234   | 201602     | 0          | 1   |
| 1234   | 201602     | 0          |   | 1234   | 201603     | 0          | 1   |
| 1234   | 201603     | 0          |   | 1234   | 201604     | 0          | 1   |
| 1234   | 201604     | 0          |   | 1234   | 201605     | 0          | 1   |
| 1234   | 201605     | 0          |   | 1234   | 201605     | 1          | 2   |
| 1234   | 201605     | 1          |   | 1234   | 201606     | 0          | 1   |
| 1234   | 201606     | 0          |   | 1234   | 201606     | 1          | 2   |
| 1234   | 201606     | 1          |   | 1234   | 201607     | 0          | 1   |
| 1234   | 201606     | 0          |   | 1234   | 201607     | 1          | 2   |
| 1234   | 201607     | 1          |   | 1234   | 201608     | 0          | 1   |
| 1234   | 201607     | 0          |   | 1234   | 201609     | 0          | 1   |
| 1234   | 201607     | 1          |   | 1234   | 201610     | 0          | 1   |
| 1234   | 201608     | 0          |   | 1234   | 201611     | 0          | 1   |
| 1234   | 201609     | 0          |   | 1234   | 201612     | 0          | 1   |
| 1234   | 201609     | 0          |   | 1234   | 201701     | 0          | 1   |
| 1234   | 201610     | 0          |   | 1234   | 201702     | 0          | 1   |
| 1234   | 201611     | 0          |
| 1234   | 201612     | 0          |
| 1234   | 201701     | 0          |
| 1234   | 201702     | 0          |
  1. Next, the mem_month_norm CTE selects mem_month_rank, and inner joins to a subquery of the user, month, and max(rank), creating a normalized table of 1 row per user and month with whether they had an offending >20 fact_count to any location that month.

(list break)

| user_id | year_month | met_notmet |
|---------|------------|------------|
| 1234    | 201601     | 0          |
| 1234    | 201602     | 0          |
| 1234    | 201603     | 0          |
| 1234    | 201604     | 0          |
| 1234    | 201605     | 1          |
| 1234    | 201606     | 1          |
| 1234    | 201607     | 1          |
| 1234    | 201608     | 0          |
| 1234    | 201609     | 0          |
| 1234    | 201610     | 0          |
| 1234    | 201611     | 0          |
| 1234    | 201612     | 0          |
| 1234    | 201701     | 0          |
| 1234    | 201702     | 0          |

OK - here's where I need your help: I don't know the best way to count the occurrences of sequential offending values. What I PRESUME needs to happen is to have some sort 'reset' when a user has 2 months in a row of offending values.

So for example, when calculating for:

  1. 2 months in a row - user_id 1234 should be counted once for having two offending MET_NOTMET values in a row on 201605 and 201606. IF he had an offending value in 201608, I assume this 'reset' would need to occur such that he would be counted again for having 2 months in a row (from 07 to 08).
  2. 3 months in a row - user_id 1234 should be counted once for three MET_NOTMET offending values from (201605 to 201608). Again, IF he had 6 offending values in a row from 201605 to 201610, I assume there would be a 'reset' to count each group of 3 offending values as 1 offense.

Again, the link is: http://rextester.com/DBF40287

Please let me know if you need any clarification or additional info, and I will revise the post. THANK YOU!

Upvotes: 0

Views: 150

Answers (1)

systemjack
systemjack

Reputation: 2985

This is a pretty complicated question and not sure I am getting it, but here's my first thoughts on my understanding of the situation.

This query should give the results for summarized data level 1:

WITH mem_months AS (
    SELECT user_id, year_month, sum(fact_count) AS visits1
    FROM tbl
    GROUP BY 1,2
),
mem_consec AS (
    SELECT user_id, year_month, visits1, visits2, visits3,
        visits1 > 20 AS met1,
        -- need case on these to account for nulls in the lag
        CASE WHEN visits1 > 20 AND visits2 > 20 THEN TRUE ELSE FALSE END AS met2,
        CASE WHEN visits1 > 20 AND visits2 > 20 AND visits3 > 20 THEN TRUE ELSE FALSE END AS met3
    FROM (
        SELECT user_id, year_month, visits1,
            -- Seems like you might want lag instead of lead to the current "month".
            -- Where 3 in a row was met, the attributed month
            -- would be the 3rd in the sequence.
            LAG(visits1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS visits2,
            LAG(visits1, 2) OVER (PARTITION BY user_id ORDER BY year_month) AS visits3
        FROM mem_months
    ) t
),
mem_compliance AS (
    select user_id, bool_or(met1) as any1, bool_or(met2) as any2, bool_or(met3) as any3
    from mem_consec
    group by 1
)
select count(case when any1 then 1 else 0 end) / count(*) as user_level_1,
    count(case when any2 then 1 else 0 end) / count(*) as user_level_2,
    count(case when any3 then 1 else 0 end) / count(*) as user_level_3
from mem_compliance

And, using the same CTEs, this would give summarized data level 2:

select sum(case when met1 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_1,
    sum(case when met2 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_2,
    sum(case when met3 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_3
from mem_consec

At this point I think I'm getting what your actual question is...in the level_2 and level_3 versions of these the count is increased as consecutive months continue. For each 3-in-a-row there will be two 2-in-a-rows counted. And for four in a row you'll get two 3s and three 2s, for example.

Here's an approach to get non-overlapping sums. Must be a way to simplify it but maybe it's a start. The idea is to look for the start and end of a run of met months, then get the length of each run and divide it by the desired set size (2 or 3). Unmet records, met records in the middle and single met records are discarded.

mem_consec_sums AS (
    SELECT user_id, sum(mon2s) AS mon2s, sum(mon3s) AS mon3s
    FROM (
        SELECT user_id,
            ((end_idx - start_idx)/2)::INT AS mon2s,
            ((end_idx - start_idx)/3)::INT AS mon3s
        FROM (
            SELECT user_id, year_month, met_start, idx AS start_idx,
                LEAD(idx, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS end_idx
            FROM (
                SELECT user_id, year_month, met1, idx,
                    CASE WHEN met1 AND pre_met IS NULL OR NOT pre_met) THEN TRUE ELSE FALSE END AS met_start,
                    CASE WHEN met1 AND post_met IS NULL OR NOT post_met) THEN TRUE ELSE FALSE END AS met_end
                FROM (
                    SELECT user_id, year_month, met1,
                        LAG(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS pre_met,
                        LEAD(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS post_met,
                        ROW_NUMBER() OVER (PARTITION BY tvid ORDER BY year_month) AS idx
                    FROM mem_consec
                )
            ) WHERE (met_start OR met_end) AND NOT (met_start AND met_end)
        ) WHERE met_start
    ) GROUP BY 1
)

Upvotes: 1

Related Questions