Clickhouse shows duplicates data in distributed table

Question

I have 3 nodes with 3 shards and 2 replicas on each:

CLickhouse cluster settings:

Added also the XML config for the sharding and replicas


   
      true
      
         shard
         clickhouse-0
         9000
         default
         default
      
      
         replica
         clickhouse-2
         9000
         default
         default
      
   
   
      true
      
         shard
         clickhouse-1
         9000
         default
         default
      
      
         replica
         clickhouse-0
         9000
         default
         default
      
   
   
      true
      
         shard
         clickhouse-2
         9000
         default
         default
      
      
         replica
         clickhouse-1
         9000
         default
         default

I am doing the following example:

create database test on cluster default_cluster;

CREATE TABLE  test.test_distributed_order_local on cluster default_cluster
(
id integer,
test_column String
) 
ENGINE = ReplicatedMergeTree('/default_cluster/test/tables/test_distributed_order_local/{shard}', '{replica}') 
PRIMARY KEY id
ORDER BY id;

CREATE TABLE  test.test_distributed_order on cluster default_cluster as test.test_distributed_order_local
ENGINE = Distributed(default_cluster, test, test_distributed_order_local, id);

insert into test.test_distributed_order values (1, 'test1');
insert into test.test_distributed_order values (2, 'test2');
insert into test.test_distributed_order values (3, 'test3');

The results are not the same, and they contain duplications: Eg

Result 1:

Result 2:

What am I missing?

I expect to not have duplicated rows in the select

Dale McDiarmid · Accepted Answer

I think this post probably sums up what you're trying to achieve - https://altinity.com/blog/2018/5/10/circular-replication-cluster-topology-in-clickhouse It's a little old but the principle applies - For Clickhouse not a topology that's recommended.

Consider this simplified example:


    // These two are replicas of each other
    
        cluster_node_0
    
    **
        cluster_node_2
    **


    
        cluster_node_1
    
    
        cluster_node_0
    


    **
        cluster_node_2
    **
    
        cluster_node_1

Let's suppose data is written into the first shard on node cluster_node_0. It will then be replicated to the shard on cluster_node_2 - as the zookeeper path is the same.

Now for the issue. You have also defined the 3rd shard on cluster_node_2. When you create this table, it will physically contain data from 2 shards - the 1st and 3rd - I've attempted to highlight with **.

When a query comes in, it will be sent to each shard. The challenge is each local table will respond with results from both shards - hence you get duplicates.

Generally, avoid more than one shard on a host - the blog explains how you can achieve more than one buts its not recommended or ever need.

Clickhouse shows duplicates data in distributed table

Answers (2)

Related Questions