Reputation: 457
I have a dataset from 1/2016 to 2/2017 for users, months, home location, visited location, and visit count. Users can visit multiple locations per month, or none. The data set contains multiple rows per month, because users can visit different locations.
Here's the data, and I will then walk you through the Rextester link below.
| user_id | year_month | homelocid | visitlocid | fact_count |
|---------|------------|-----------|------------|------------|
| 1234 | 201601 | L9900 | L9987 | 8 |
| 1234 | 201601 | L9900 | L4456 | 1 |
| 1234 | 201602 | L9900 | L9987 | 5 |
| 1234 | 201603 | L9900 | L9987 | 1 |
| 1234 | 201604 | L9900 | L9987 | 4 |
| 1234 | 201605 | L9900 | L9987 | 3 |
| 1234 | 201605 | L9900 | L4456 | 3 |
| 1234 | 201605 | L9900 | L5567 | 26 |
| 1234 | 201606 | L9900 | L4456 | 1 |
| 1234 | 201606 | L9900 | L9987 | 21 |
| 1234 | 201606 | L9900 | L6678 | 2 |
| 1234 | 201606 | L9900 | L5567 | 1 |
| 1234 | 201607 | L9900 | L9987 | 6 |
| 1234 | 201607 | L9900 | L5567 | 22 |
| 1234 | 201607 | L9900 | L6678 | 25 |
| 1234 | 201608 | NULL | NULL | NULL |
| 1234 | 201609 | L9900 | L9987 | 1 |
| 1234 | 201609 | L9900 | L4456 | 3 |
| 1234 | 201610 | L9900 | L9987 | 1 |
| 1234 | 201611 | NULL | NULL | NULL |
| 1234 | 201612 | NULL | NULL | NULL |
| 1234 | 201701 | NULL | NULL | NULL |
| 1234 | 201702 | L9900 | L9987 | 6 |
| 1234 | 201702 | L9900 | L2334 | 1 |
| 5556 | 201601 | L9900 | L3445 | 1 |
| 5556 | 201602 | L9900 | L3445 | 3 |
| 5556 | 201603 | L9900 | L3445 | 1 |
| 5556 | 201604 | L9900 | L3445 | 21 |
| 5556 | 201605 | L9900 | L3445 | 1 |
| 5556 | 201606 | L9900 | L3445 | 2 |
| 5556 | 201607 | NULL | NULL | NULL |
| 5556 | 201608 | NULL | NULL | NULL |
| 5556 | 201609 | L9900 | L3445 | 1 |
| 5556 | 201610 | NULL | NULL | NULL |
| 5556 | 201611 | NULL | NULL | NULL |
| 5556 | 201612 | L9900 | L3445 | 1 |
| 5556 | 201701 | NULL | NULL | NULL |
| 5556 | 201702 | L9900 | L3445 | 1 |
| 9988 | 201601 | L9900 | L2334 | 27 |
| 9988 | 201602 | L9900 | L3445 | 22 |
| 9988 | 201602 | L9900 | L9987 | 1 |
| 9988 | 201602 | L9900 | L2334 | 2 |
| 9988 | 201603 | L9900 | L4556 | 1 |
| 9988 | 201603 | L9900 | L2334 | 2 |
| 9988 | 201603 | L9900 | L3445 | 21 |
| 9988 | 201603 | L9900 | L9987 | 1 |
| 9988 | 201604 | L9900 | L3445 | 25 |
| 9988 | 201604 | L9900 | L2334 | 1 |
| 9988 | 201604 | L9900 | L4556 | 1 |
| 9988 | 201605 | L9900 | L3445 | 26 |
| 9988 | 201605 | L9900 | L2334 | 1 |
| 9988 | 201606 | L9900 | L3445 | 22 |
| 9988 | 201607 | L9900 | L3445 | 5 |
| 9988 | 201608 | L9900 | L3445 | 6 |
| 9988 | 201609 | L9900 | L3445 | 11 |
| 9988 | 201610 | L9900 | L3445 | 17 |
| 9988 | 201611 | L9900 | L3445 | 9 |
| 9988 | 201612 | L9900 | L2334 | 1 |
| 9988 | 201612 | L9900 | L3445 | 14 |
| 9988 | 201701 | L9900 | L3445 | 14 |
| 9988 | 201701 | L9900 | L2334 | 2 |
| 9988 | 201702 | L9900 | L4556 | 1 |
| 9988 | 201702 | L9900 | L3445 | 9 |
I am trying to calculate the number of times that users had:
I also need to summarize the data on two levels:
Here's the desired outcome (note: not query output, just outcome) that I'm looking for from the data set:
+-------------------+------------+------------------+
| | User level | User-month level |
+-------------------+------------+------------------+
| 1 month | 3/3 | 10/42 |
+-------------------+------------+------------------+
| 2 months in a row | 2/3 | 4/42 |
+-------------------+------------+------------------+
| 3 months in a row | 2/3 | 3/42 |
+-------------------+------------+------------------+
Here is my approach to transforming the data to summarize it correctly, which you can find in this Rextester link: http://rextester.com/DBF40287
Here is the output from these queries, getting closer to a normalized table:
| mem_month | | mem_month_rank |
|--------|------------|------------|---|----------------------------------------|
| user_id| year_month | met_notmet | | user_id| year_month | met_notmet | rnk |
| 1234 | 201601 | 0 | | 1234 | 201601 | 0 | 1 |
| 1234 | 201601 | 0 | | 1234 | 201602 | 0 | 1 |
| 1234 | 201602 | 0 | | 1234 | 201603 | 0 | 1 |
| 1234 | 201603 | 0 | | 1234 | 201604 | 0 | 1 |
| 1234 | 201604 | 0 | | 1234 | 201605 | 0 | 1 |
| 1234 | 201605 | 0 | | 1234 | 201605 | 1 | 2 |
| 1234 | 201605 | 1 | | 1234 | 201606 | 0 | 1 |
| 1234 | 201606 | 0 | | 1234 | 201606 | 1 | 2 |
| 1234 | 201606 | 1 | | 1234 | 201607 | 0 | 1 |
| 1234 | 201606 | 0 | | 1234 | 201607 | 1 | 2 |
| 1234 | 201607 | 1 | | 1234 | 201608 | 0 | 1 |
| 1234 | 201607 | 0 | | 1234 | 201609 | 0 | 1 |
| 1234 | 201607 | 1 | | 1234 | 201610 | 0 | 1 |
| 1234 | 201608 | 0 | | 1234 | 201611 | 0 | 1 |
| 1234 | 201609 | 0 | | 1234 | 201612 | 0 | 1 |
| 1234 | 201609 | 0 | | 1234 | 201701 | 0 | 1 |
| 1234 | 201610 | 0 | | 1234 | 201702 | 0 | 1 |
| 1234 | 201611 | 0 |
| 1234 | 201612 | 0 |
| 1234 | 201701 | 0 |
| 1234 | 201702 | 0 |
(list break)
| user_id | year_month | met_notmet |
|---------|------------|------------|
| 1234 | 201601 | 0 |
| 1234 | 201602 | 0 |
| 1234 | 201603 | 0 |
| 1234 | 201604 | 0 |
| 1234 | 201605 | 1 |
| 1234 | 201606 | 1 |
| 1234 | 201607 | 1 |
| 1234 | 201608 | 0 |
| 1234 | 201609 | 0 |
| 1234 | 201610 | 0 |
| 1234 | 201611 | 0 |
| 1234 | 201612 | 0 |
| 1234 | 201701 | 0 |
| 1234 | 201702 | 0 |
OK - here's where I need your help: I don't know the best way to count the occurrences of sequential offending values. What I PRESUME needs to happen is to have some sort 'reset' when a user has 2 months in a row of offending values.
So for example, when calculating for:
Again, the link is: http://rextester.com/DBF40287
Please let me know if you need any clarification or additional info, and I will revise the post. THANK YOU!
Upvotes: 0
Views: 150
Reputation: 2985
This is a pretty complicated question and not sure I am getting it, but here's my first thoughts on my understanding of the situation.
This query should give the results for summarized data level 1:
WITH mem_months AS (
SELECT user_id, year_month, sum(fact_count) AS visits1
FROM tbl
GROUP BY 1,2
),
mem_consec AS (
SELECT user_id, year_month, visits1, visits2, visits3,
visits1 > 20 AS met1,
-- need case on these to account for nulls in the lag
CASE WHEN visits1 > 20 AND visits2 > 20 THEN TRUE ELSE FALSE END AS met2,
CASE WHEN visits1 > 20 AND visits2 > 20 AND visits3 > 20 THEN TRUE ELSE FALSE END AS met3
FROM (
SELECT user_id, year_month, visits1,
-- Seems like you might want lag instead of lead to the current "month".
-- Where 3 in a row was met, the attributed month
-- would be the 3rd in the sequence.
LAG(visits1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS visits2,
LAG(visits1, 2) OVER (PARTITION BY user_id ORDER BY year_month) AS visits3
FROM mem_months
) t
),
mem_compliance AS (
select user_id, bool_or(met1) as any1, bool_or(met2) as any2, bool_or(met3) as any3
from mem_consec
group by 1
)
select count(case when any1 then 1 else 0 end) / count(*) as user_level_1,
count(case when any2 then 1 else 0 end) / count(*) as user_level_2,
count(case when any3 then 1 else 0 end) / count(*) as user_level_3
from mem_compliance
And, using the same CTEs, this would give summarized data level 2:
select sum(case when met1 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_1,
sum(case when met2 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_2,
sum(case when met3 then 1 else 0 end) / count(distinct(year_month)) as user_month_level_3
from mem_consec
At this point I think I'm getting what your actual question is...in the level_2 and level_3 versions of these the count is increased as consecutive months continue. For each 3-in-a-row there will be two 2-in-a-rows counted. And for four in a row you'll get two 3s and three 2s, for example.
Here's an approach to get non-overlapping sums. Must be a way to simplify it but maybe it's a start. The idea is to look for the start and end of a run of met months, then get the length of each run and divide it by the desired set size (2 or 3). Unmet records, met records in the middle and single met records are discarded.
mem_consec_sums AS (
SELECT user_id, sum(mon2s) AS mon2s, sum(mon3s) AS mon3s
FROM (
SELECT user_id,
((end_idx - start_idx)/2)::INT AS mon2s,
((end_idx - start_idx)/3)::INT AS mon3s
FROM (
SELECT user_id, year_month, met_start, idx AS start_idx,
LEAD(idx, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS end_idx
FROM (
SELECT user_id, year_month, met1, idx,
CASE WHEN met1 AND pre_met IS NULL OR NOT pre_met) THEN TRUE ELSE FALSE END AS met_start,
CASE WHEN met1 AND post_met IS NULL OR NOT post_met) THEN TRUE ELSE FALSE END AS met_end
FROM (
SELECT user_id, year_month, met1,
LAG(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS pre_met,
LEAD(met1, 1) OVER (PARTITION BY user_id ORDER BY year_month) AS post_met,
ROW_NUMBER() OVER (PARTITION BY tvid ORDER BY year_month) AS idx
FROM mem_consec
)
) WHERE (met_start OR met_end) AND NOT (met_start AND met_end)
) WHERE met_start
) GROUP BY 1
)
Upvotes: 1