Dmitry Bevz
Dmitry Bevz

Reputation: 123

Clickhouse - join on string columns

I got String column uin in several tables, how do I can effectively join on uin these tables?

In Vertica database we use hash(uin) to transform string column into hash with Int data type - it significantly boosts efficiency in joins - could you recommend something like this? I tried CRC32(s) but it seems to work wrong.

Upvotes: 2

Views: 3510

Answers (1)

vladimir
vladimir

Reputation: 15218

At this moment the CH not very good cope with multi-joins queries (DB star-schema) and the query optimizer not good enough to rely on it completely.

So it needs to explicitly say how to 'execute' a query by using subqueries instead of joins.

Let's emulate your query:

SELECT table_01.number AS r
FROM numbers(87654321) AS table_01
  INNER JOIN numbers(7654321) AS table_02 ON (table_01.number = table_02.number)
  INNER JOIN numbers(654321) AS table_03 ON (table_02.number = table_03.number)
  INNER JOIN numbers(54321) AS table_04 ON (table_03.number = table_04.number)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘

8 rows in set. Elapsed: 4.244 sec. Processed 96.06 million rows, 768.52 MB (22.64 million rows/s., 181.10 MB/s.) 
*/

On my PC it takes ~4 secs. Let's rewrite it using subqueries to significantly speed it up.

SELECT number AS r
FROM numbers(87654321)
WHERE number IN (
  SELECT number 
  FROM numbers(7654321)
  WHERE number IN (
    SELECT number 
    FROM numbers(654321)
    WHERE number IN (
      SELECT number 
      FROM numbers(54321)
    )
  )
)
ORDER BY r DESC
LIMIT 8;
/*
┌─────r─┐
│ 54320 │
│ 54319 │
│ 54318 │
│ 54317 │
│ 54316 │
│ 54315 │
│ 54314 │
│ 54313 │
└───────┘

8 rows in set. Elapsed: 0.411 sec. Processed 96.06 million rows, 768.52 MB (233.50 million rows/s., 1.87 GB/s.) 
*/

There are other ways to optimize JOIN:

  • use External dictionary to get rid of join on 'small'-table

  • use Join table engine

  • use ANY-strictness

  • use specific settings like join_algorithm, partial_merge_join_optimizations etc


Some useful refs:

Altinity webinar: Tips and tricks every ClickHouse user should know

Altinity webinar: Secrets of ClickHouse Query Performance


Answer update:

To less storage consumption for String-column consider changing column type to LowCardinality (link 2) that significantly decrease the size of a column with many duplicated elements.

Use this query to get the size of columns:

SELECT
    name AS column_name,
    formatReadableSize(data_compressed_bytes) AS data_size,
    formatReadableSize(marks_bytes) AS index_size,
    type,
    compression_codec
FROM system.columns
WHERE database = 'db_name' AND table = 'table_name'
ORDER BY data_compressed_bytes DESC

To get a numeric representation of a string need to use one of hash-functions.

SELECT 'jsfhuhsdf', xxHash32('jsfhuhsdf'), cityHash64('jsfhuhsdf');

Upvotes: 3

Related Questions