Reputation: 23078
Given the following table:
transaction_id user_id product_id
1 10 AA
2 10 CC
3 10 AA
4 10 CC
5 20 AA
6 20 BB
7 20 BB
8 30 BB
9 30 BB
10 30 BB
11 40 CC
12 40 AA
13 40 CC
14 40 BB
15 40 BB
16 50 EE
17 60 EE
Using the following query:
select
product_id,
count(distinct user_id) as count_repeat_users
from
product_usage_log
where
(product_id, user_id) in (
select
product_id,
user_id
from (
select
product_id,
user_id,
count (distinct transaction_id) as transactions
from
product_usage_log
group by
product_id,
user_id
) t
where transactions >= 2
)
group by product_id
Returns the following result:
product_id count_repeat_users
AA 1
BB 3
CC 2
(note that 'EE' doesn't appear, as expected)
The purpose of the query above is to return, for every product, the count of users having made at least two transactions with this product. The above query satisfies this, however it is using a multiple-column subquery with an IN
predicate. This capability is not available (yet, although it's been talked about for the past two years with no success) in Presto.
How to replicate the above result without the possibility to use where (product_id, user_id) in (...)
?
Note: I've tried to flatten the where
condition into two successive ones, the problem being that now the condition on ALL columns being matched for EVERY row turns into a condition on ALL columns being matched for ANY row. In other words, now it will match a user-product couple as soon as the product is in the subtable, and the user is in the subtable, but not necessarily in the same row.
Another way to phrase the question is therefore: in Presto, how to make a condition based on a couple of values being present on the SAME row in a subquery?
Upvotes: 1
Views: 5416
Reputation: 164139
I don't see the reason (at least from your sample data) why you use that WHERE...IN...
.
You can get what you need without it:
select t.product_id, count(*) count_repeat_users
from (
select user_id, product_id
from product_usage_log
group by user_id, product_id
having count(transaction_id) > 1
) as t
group by product_id
See the demo (for SQL Server but since the code is standard SQL it should work for Presto too).
Results:
product_id | count_repeat_users
AA | 1
BB | 3
CC | 2
Upvotes: 1
Reputation: 20770
How to replicate the above result without the possibility to use where
(product_id, user_id) in (...)
?
This is directly available in Presto.
You just need to wrap values produced by subquery in anonymous ROW
s, so that they are, in fact, single-column.
Testing with Presto 318:
presto:default> SELECT
-> x, y
-> FROM (VALUES (1,2), (3,4), (5,6)) t(x, y)
-> WHERE (x, y) IN (
-> SELECT (z, w)
-> FROM (VALUES (1,1), (3,4), (5,5)) u(z, w)
-> );
x | y
---+---
3 | 4
(1 row)
Another example with tpch.tiny
schema:
presto:tiny> SELECT orderkey
-> FROM orders
-> JOIN customer ON orders.custkey = customer.custkey
-> WHERE (orderkey, nationkey) IN (
-> SELECT (suppkey, nationkey) FROM supplier
-> );
orderkey
----------
3
(1 row)
Note: I'm not entirely sure this works correct with respect to NULL
s. I guess this is not a problem in your case and your subquery does not produce NULL
s for product_id, user_id
.
Upvotes: 3
Reputation: 1270391
You can use window functions. I think this will work:
select product_id, count(distinct user_id)
from (select pul.*,
count(*) over (partition by product_id, user_id) as cnt
from product_usage_log pul
) pul
where cnt >= 2
group by product_id;
Based on your sample data, I am guessing that transaction_id
is unique. If not, then use count(distinct transaction_id)
in the subquery.
Upvotes: 1