Reputation: 271
I have two tables
event
params
This table have relation 1-1 by id. If execute query
select count(*)
from
(select id from event where os like 'Android%')
inner join
(select id from params where sx >= 1024)
using id
they very slow
But if all data contains in one table
select count(*) from event where sx >= 1024 and os like 'Android%'
Query executed very fast.
Please, tell me how use join in ClickHouse DB effective? Keep all data in one table is not convenient.
Upvotes: 8
Views: 16806
Reputation: 3
Instead of using inner join, try using LEFT SEMI JOIN
. This will not produce cartesian product. When a record on the left found a matching row on the right, it stops searching remaining rows on the right. You can use this join if you are sure that there will be only one matching record on the right.
Upvotes: 0
Reputation: 71
I experience same problems with joining 2 huge distributed tables. There are 2 main problems
What works for me is sharding calculation query by id%N using subqueries and union all results then.
SELECT count(*)
FROM
(
SELECT 1
FROM event
WHERE id%2=0 AND id IN
(
SELECT id
FROM params
WHERE id % 2 = 0 AND sx >= 1024
)
UNION ALL
SELECT 2
FROM event
WHERE id % 2 = 1 AND id IN
(
SELECT id
FROM params
WHERE id % 2 = 1 AND sx >= 1024
)
)
You can change id%N(2 in the example) until you get needed performance. Need to replace IN to GLOBAL IN if you use distributed engines for tables.
Upvotes: 5
Reputation: 911
You may rewrite query like this:
select count(*)
from event
where os like 'Android%'
AND id IN (select id from params where sx >= 1024)
Upvotes: 3