Oleg Khamov
Oleg Khamov

Reputation: 271

How make JOIN table in ClickHouse DB faster?

I have two tables

event

params

This table have relation 1-1 by id. If execute query

select count(*)
from
(select id from event where os like 'Android%')
inner join
(select id from params where sx >= 1024)
using id

they very slow

But if all data contains in one table

select count(*) from event where sx >= 1024 and os like 'Android%'

Query executed very fast.

Please, tell me how use join in ClickHouse DB effective? Keep all data in one table is not convenient.

Upvotes: 8

Views: 16806

Answers (3)

Gowtham Sankaran
Gowtham Sankaran

Reputation: 3

Instead of using inner join, try using LEFT SEMI JOIN. This will not produce cartesian product. When a record on the left found a matching row on the right, it stops searching remaining rows on the right. You can use this join if you are sure that there will be only one matching record on the right.

Upvotes: 0

Evgeniy Skomorokhov
Evgeniy Skomorokhov

Reputation: 71

I experience same problems with joining 2 huge distributed tables. There are 2 main problems

  • durion of executing
  • limits by needed memory for a query.

What works for me is sharding calculation query by id%N using subqueries and union all results then.

SELECT count(*)
FROM
(
    SELECT 1
    FROM event
    WHERE id%2=0 AND id IN
    (
        SELECT id
        FROM params
        WHERE id % 2 = 0 AND sx >= 1024
    )
    UNION ALL
    SELECT 2
    FROM event
    WHERE id % 2 = 1 AND id IN
    (
        SELECT id
        FROM params
        WHERE id % 2 = 1 AND sx >= 1024
    )
)

You can change id%N(2 in the example) until you get needed performance. Need to replace IN to GLOBAL IN if you use distributed engines for tables.

Upvotes: 5

uYSIZfoz
uYSIZfoz

Reputation: 911

You may rewrite query like this:

select count(*)
from event 
where os like 'Android%' 
AND id IN (select id from params where sx >= 1024)

Upvotes: 3

Related Questions