burak emre
burak emre

Reputation: 1601

Correlated subqueries in Snowflake doesn't work

I'm trying to run the following query in Snowflake but it fails with Unsupported subquery type cannot be evaluated. The query is valid in other SQL engines such as Postgresql and Presto so it looks like Snowflake doesn't support this type of query.

SELECT first_action.date, 
  DATEDIFF('day', first_action.date, returning_action.date) - 1 as diff, 
  APPROXIMATE_SIMILARITY(select MINHASH_COMBINE(value) from (select first_action.user_id_set as value union all select returning_action.user_id_set)) _set
  FROM (select cast(_time as date) as date, minhash(100, _user) as user_id_set from events group by 1) as first_action
  JOIN (select cast(_time as date) as date, minhash(100, _user) as user_id_set from events group by 1) as returning_action 
ON (first_action.date < returning_action.date AND dateadd(day, 14, first_action.date) >= returning_action.date)
group by 1,2

The query is a typical cohort query that uses MinHash. We calculate the MinHash for each day, join the next 14 days and merge the results and finally calculate the final result.

Since MinHash doesn't have a linear MINHASH_COMBINE function, we had to use a subquery with UNION all in order to make it work but that also didn't work out. :/

We're stuck right now since we don't really know any workaround. Any help is appreciated!

Upvotes: 0

Views: 1434

Answers (2)

Simeon Pilgrim
Simeon Pilgrim

Reputation: 26078

So the major trick is that all these MINHASH_ functions window functions so you need to build a grouping key on the data.

So using this as my example data:

CREATE TABLE events(_user number, _time timestamp_ntz);
INSERT INTO events VALUES (1,'2019-03-01'),(1,'2019-03-05'),(1,'2019-03-10'),
    (1,'2019-03-14'),(1,'2019-03-15'),(1,'2019-03-16'),
    (2,'2019-03-01'),(2,'2019-03-05'),(2,'2019-03-11'),
    (2,'2019-03-15');

The first set is to get the 14 days of data for the COMBINE

WITH actions AS (
    SELECT _time::date as date
        ,dateadd(day, 14, date) as date14
        ,minhash(100, _user) as user_id_set
    FROM events
    GROUP BY 1
)
SELECT fa.date
    ,ARRAY_AGG(ra.date) WITHIN GROUP (ORDER BY ra.date)
    ,MINHASH_COMBINE(ra.user_id_set) AS sets
FROM actions AS fa
JOIN actions AS ra 
    ON (fa.date <= ra.date AND fa.date14 > ra.date) 
GROUP BY 1
ORDER BY 1; 

this is similar to your code, but here I include on the RA the same day as FA. so I can group by FA.date, but have FA's data included. On the date range I was not sure if you are want 14 days later or 14 days worth of date. I assume the later, thus changed the end range termination.

Now we have the combined next 14 days of data for each day, we want to get the pairs of (in my code I do not put a max days of compare but just include all pairs). Now again the APPROXIMATE_SIMILARITY is a window function, so I build an Array which I will rip straight apart again, thus pivoting the data, which is what you're trying to do via the union all (this can bee seen in the pairs and unrolled CTE's)

WITH actions AS (
    SELECT _time::date AS date
        ,dateadd(day, 14, date) as date14
        ,minhash(100, _user) AS user_id_set
    FROM events
    GROUP BY 1
), combined AS (
    SELECT fa.date
        ,MINHASH_COMBINE(ra.user_id_set) AS sets
    FROM actions AS fa
    JOIN actions AS ra 
        ON fa.date <= ra.date AND fa.date14 > ra.date
    GROUP BY 1
), pairs AS (
    SELECT fa.date
        ,DATEDIFF('day', fa.date, ra.date) AS diff
        ,ARRAY_CONSTRUCT(fa.sets,ra.sets) AS comp_set
    FROM combined AS fa
    JOIN combined AS ra 
        ON fa.date < ra.date
), unrolled AS (
    SELECT date
        ,diff
        ,f.value AS sets
    FROM pairs p,
    LATERAL FLATTEN(input => p.comp_set) f
)
SELECT date
    ,diff
    ,APPROXIMATE_SIMILARITY(sets)
FROM unrolled
GROUP BY 1,2
ORDER BY 1,2;

and thus you get results for all days

DATE    DIFF    APPROXIMATE_SIMILARITY(SETS)
2019-03-01  4   1
2019-03-01  9   1
2019-03-01  10  1
2019-03-01  13  1
2019-03-01  14  1
2019-03-01  15  0.51
2019-03-05  5   1
2019-03-05  6   1
2019-03-05  9   1
2019-03-05  10  1
2019-03-05  11  0.51
2019-03-10  1   1
2019-03-10  4   1
2019-03-10  5   1
2019-03-10  6   0.51
2019-03-11  3   1
2019-03-11  4   1
2019-03-11  5   0.51
2019-03-14  1   1
2019-03-14  2   0.51
2019-03-15  1   0.51

Upvotes: 1

Nilesh Ingle
Nilesh Ingle

Reputation: 1883

Not sure if this will work, tried separating the first_action and returning_action using WITH statement:

WITH 
first_action as (
    SELECT 
        TRY_CAST(_time AS DATE) as date, 
        MINHASH(100, _user) as user_id_set 
    FROM events 
    GROUP BY 1
),
returning_action as (
    SELECT 
        TRY_CAST(_time AS DATE) as date, 
        MINHASH(100, _user) as user_id_set 
    FROM events 
    GROUP BY 1
),
SELECT 
  first_action.date, 
  DATEDIFF('day', fa.date, ra.date) - 1 as diff, 
  APPROXIMATE_SIMILARITY(
      SELECT MINHASH_COMBINE(value) 
      FROM (
          SELECT fa.user_id_set AS value FROM first_action fa
          UNION ALL  
          SELECT ra.user_id_set AS value FROM returning_action ra
      )
  ) _set
FROM first_action fa
JOIN returning_action ra
ON (fa.date < ra.date AND DATEADD(day, 14, fa.date) >= ra.date)
GROUP BY 1,2

Upvotes: 1

Related Questions