Reputation: 1601
I'm trying to run the following query in Snowflake but it fails with Unsupported subquery type cannot be evaluated
. The query is valid in other SQL engines such as Postgresql and Presto so it looks like Snowflake doesn't support this type of query.
SELECT first_action.date,
DATEDIFF('day', first_action.date, returning_action.date) - 1 as diff,
APPROXIMATE_SIMILARITY(select MINHASH_COMBINE(value) from (select first_action.user_id_set as value union all select returning_action.user_id_set)) _set
FROM (select cast(_time as date) as date, minhash(100, _user) as user_id_set from events group by 1) as first_action
JOIN (select cast(_time as date) as date, minhash(100, _user) as user_id_set from events group by 1) as returning_action
ON (first_action.date < returning_action.date AND dateadd(day, 14, first_action.date) >= returning_action.date)
group by 1,2
The query is a typical cohort query that uses MinHash. We calculate the MinHash for each day, join the next 14 days and merge the results and finally calculate the final result.
Since MinHash doesn't have a linear MINHASH_COMBINE function, we had to use a subquery with UNION all in order to make it work but that also didn't work out. :/
We're stuck right now since we don't really know any workaround. Any help is appreciated!
Upvotes: 0
Views: 1434
Reputation: 26078
So the major trick is that all these MINHASH_ functions window functions so you need to build a grouping key on the data.
So using this as my example data:
CREATE TABLE events(_user number, _time timestamp_ntz);
INSERT INTO events VALUES (1,'2019-03-01'),(1,'2019-03-05'),(1,'2019-03-10'),
(1,'2019-03-14'),(1,'2019-03-15'),(1,'2019-03-16'),
(2,'2019-03-01'),(2,'2019-03-05'),(2,'2019-03-11'),
(2,'2019-03-15');
The first set is to get the 14 days of data for the COMBINE
WITH actions AS (
SELECT _time::date as date
,dateadd(day, 14, date) as date14
,minhash(100, _user) as user_id_set
FROM events
GROUP BY 1
)
SELECT fa.date
,ARRAY_AGG(ra.date) WITHIN GROUP (ORDER BY ra.date)
,MINHASH_COMBINE(ra.user_id_set) AS sets
FROM actions AS fa
JOIN actions AS ra
ON (fa.date <= ra.date AND fa.date14 > ra.date)
GROUP BY 1
ORDER BY 1;
this is similar to your code, but here I include on the RA the same day as FA. so I can group by FA.date, but have FA's data included. On the date range I was not sure if you are want 14 days later or 14 days worth of date. I assume the later, thus changed the end range termination.
Now we have the combined next 14 days of data for each day, we want to get the pairs of (in my code I do not put a max days of compare but just include all pairs). Now again the APPROXIMATE_SIMILARITY is a window function, so I build an Array which I will rip straight apart again, thus pivoting the data, which is what you're trying to do via the union all (this can bee seen in the pairs and unrolled CTE's)
WITH actions AS (
SELECT _time::date AS date
,dateadd(day, 14, date) as date14
,minhash(100, _user) AS user_id_set
FROM events
GROUP BY 1
), combined AS (
SELECT fa.date
,MINHASH_COMBINE(ra.user_id_set) AS sets
FROM actions AS fa
JOIN actions AS ra
ON fa.date <= ra.date AND fa.date14 > ra.date
GROUP BY 1
), pairs AS (
SELECT fa.date
,DATEDIFF('day', fa.date, ra.date) AS diff
,ARRAY_CONSTRUCT(fa.sets,ra.sets) AS comp_set
FROM combined AS fa
JOIN combined AS ra
ON fa.date < ra.date
), unrolled AS (
SELECT date
,diff
,f.value AS sets
FROM pairs p,
LATERAL FLATTEN(input => p.comp_set) f
)
SELECT date
,diff
,APPROXIMATE_SIMILARITY(sets)
FROM unrolled
GROUP BY 1,2
ORDER BY 1,2;
and thus you get results for all days
DATE DIFF APPROXIMATE_SIMILARITY(SETS)
2019-03-01 4 1
2019-03-01 9 1
2019-03-01 10 1
2019-03-01 13 1
2019-03-01 14 1
2019-03-01 15 0.51
2019-03-05 5 1
2019-03-05 6 1
2019-03-05 9 1
2019-03-05 10 1
2019-03-05 11 0.51
2019-03-10 1 1
2019-03-10 4 1
2019-03-10 5 1
2019-03-10 6 0.51
2019-03-11 3 1
2019-03-11 4 1
2019-03-11 5 0.51
2019-03-14 1 1
2019-03-14 2 0.51
2019-03-15 1 0.51
Upvotes: 1
Reputation: 1883
Not sure if this will work, tried separating the first_action
and returning_action
using WITH
statement:
WITH
first_action as (
SELECT
TRY_CAST(_time AS DATE) as date,
MINHASH(100, _user) as user_id_set
FROM events
GROUP BY 1
),
returning_action as (
SELECT
TRY_CAST(_time AS DATE) as date,
MINHASH(100, _user) as user_id_set
FROM events
GROUP BY 1
),
SELECT
first_action.date,
DATEDIFF('day', fa.date, ra.date) - 1 as diff,
APPROXIMATE_SIMILARITY(
SELECT MINHASH_COMBINE(value)
FROM (
SELECT fa.user_id_set AS value FROM first_action fa
UNION ALL
SELECT ra.user_id_set AS value FROM returning_action ra
)
) _set
FROM first_action fa
JOIN returning_action ra
ON (fa.date < ra.date AND DATEADD(day, 14, fa.date) >= ra.date)
GROUP BY 1,2
Upvotes: 1