Jivan
Jivan

Reputation: 23078

Multiple columns returned by subquery are not yet supported

Given the following table:

transaction_id    user_id    product_id
             1         10            AA
             2         10            CC
             3         10            AA
             4         10            CC
             5         20            AA
             6         20            BB
             7         20            BB
             8         30            BB
             9         30            BB
            10         30            BB
            11         40            CC
            12         40            AA
            13         40            CC
            14         40            BB
            15         40            BB
            16         50            EE
            17         60            EE

Using the following query:

select
  product_id,
  count(distinct user_id) as count_repeat_users
from
  product_usage_log
where
  (product_id, user_id) in (
    select
      product_id,
      user_id
    from (
      select
        product_id,
        user_id,
        count (distinct transaction_id) as transactions
      from
        product_usage_log
      group by
        product_id,
        user_id
    ) t
    where transactions >= 2
  )
group by product_id

Returns the following result:

product_id    count_repeat_users
        AA                     1
        BB                     3
        CC                     2

(note that 'EE' doesn't appear, as expected)

The purpose of the query above is to return, for every product, the count of users having made at least two transactions with this product. The above query satisfies this, however it is using a multiple-column subquery with an IN predicate. This capability is not available (yet, although it's been talked about for the past two years with no success) in Presto.

How to replicate the above result without the possibility to use where (product_id, user_id) in (...)?

Note: I've tried to flatten the where condition into two successive ones, the problem being that now the condition on ALL columns being matched for EVERY row turns into a condition on ALL columns being matched for ANY row. In other words, now it will match a user-product couple as soon as the product is in the subtable, and the user is in the subtable, but not necessarily in the same row.

Another way to phrase the question is therefore: in Presto, how to make a condition based on a couple of values being present on the SAME row in a subquery?

Upvotes: 1

Views: 5416

Answers (3)

forpas
forpas

Reputation: 164139

I don't see the reason (at least from your sample data) why you use that WHERE...IN....
You can get what you need without it:

select t.product_id, count(*) count_repeat_users
from (
  select user_id, product_id
  from product_usage_log  
  group by user_id, product_id
  having count(transaction_id) > 1
) as t
group by product_id  

See the demo (for SQL Server but since the code is standard SQL it should work for Presto too).
Results:

product_id | count_repeat_users
AA         |                  1
BB         |                  3
CC         |                  2

Upvotes: 1

Piotr Findeisen
Piotr Findeisen

Reputation: 20770

How to replicate the above result without the possibility to use where (product_id, user_id) in (...)?

This is directly available in Presto. You just need to wrap values produced by subquery in anonymous ROWs, so that they are, in fact, single-column.

Testing with Presto 318:

presto:default> SELECT
             ->     x, y
             -> FROM (VALUES (1,2), (3,4), (5,6)) t(x, y)
             -> WHERE (x, y) IN (
             ->     SELECT (z, w)
             ->     FROM (VALUES (1,1), (3,4), (5,5)) u(z, w)
             -> );
 x | y
---+---
 3 | 4
(1 row)

Another example with tpch.tiny schema:

presto:tiny> SELECT orderkey
          -> FROM orders
          -> JOIN customer ON orders.custkey = customer.custkey
          -> WHERE (orderkey, nationkey) IN (
          ->     SELECT (suppkey, nationkey) FROM supplier
          -> );
 orderkey
----------
        3
(1 row)

Note: I'm not entirely sure this works correct with respect to NULLs. I guess this is not a problem in your case and your subquery does not produce NULLs for product_id, user_id.

Upvotes: 3

Gordon Linoff
Gordon Linoff

Reputation: 1270391

You can use window functions. I think this will work:

select product_id, count(distinct user_id)
from (select pul.*,
             count(*) over (partition by product_id, user_id) as cnt
      from product_usage_log pul
     ) pul
where cnt >= 2
group by product_id;

Based on your sample data, I am guessing that transaction_id is unique. If not, then use count(distinct transaction_id) in the subquery.

Upvotes: 1

Related Questions