Marius Butan
Marius Butan

Reputation: 3

Clickhouse shows duplicates data in distributed table

I have 3 nodes with 3 shards and 2 replicas on each:

CLickhouse cluster settings: CLickhouse cluster settings

Added also the XML config for the sharding and replicas

<default_cluster>
   <shard>
      <internal_replication>true</internal_replication>
      <replica>
         <default_database>shard</default_database>
         <host>clickhouse-0</host>
         <port>9000</port>
         <user>default</user>
         <password>default</password>
      </replica>
      <replica>
         <default_database>replica</default_database>
         <host>clickhouse-2</host>
         <port>9000</port>
         <user>default</user>
         <password>default</password>
      </replica>
   </shard>
   <shard>
      <internal_replication>true</internal_replication>
      <replica>
         <default_database>shard</default_database>
         <host>clickhouse-1</host>
         <port>9000</port>
         <user>default</user>
         <password>default</password>
      </replica>
      <replica>
         <default_database>replica</default_database>
         <host>clickhouse-0</host>
         <port>9000</port>
         <user>default</user>
         <password>default</password>
      </replica>
   </shard>
   <shard>
      <internal_replication>true</internal_replication>
      <replica>
         <default_database>shard</default_database>
         <host>clickhouse-2</host>
         <port>9000</port>
         <user>default</user>
         <password>default</password>
      </replica>
      <replica>
         <default_database>replica</default_database>
         <host>clickhouse-1</host>
         <port>9000</port>
         <user>default</user>
         <password>default</password>
      </replica>
   </shard>
</default_cluster>

I am doing the following example:

create database test on cluster default_cluster;

CREATE TABLE  test.test_distributed_order_local on cluster default_cluster
(
id integer,
test_column String
) 
ENGINE = ReplicatedMergeTree('/default_cluster/test/tables/test_distributed_order_local/{shard}', '{replica}') 
PRIMARY KEY id
ORDER BY id;

CREATE TABLE  test.test_distributed_order on cluster default_cluster as test.test_distributed_order_local
ENGINE = Distributed(default_cluster, test, test_distributed_order_local, id);

insert into test.test_distributed_order values (1, 'test1');
insert into test.test_distributed_order values (2, 'test2');
insert into test.test_distributed_order values (3, 'test3');

The results are not the same, and they contain duplications: Eg

Result 1:

enter image description here

Result 2:

enter image description here

What am I missing?

I expect to not have duplicated rows in the select

Upvotes: 0

Views: 1091

Answers (2)

Dale McDiarmid
Dale McDiarmid

Reputation: 111

I think this post probably sums up what you're trying to achieve - https://altinity.com/blog/2018/5/10/circular-replication-cluster-topology-in-clickhouse It's a little old but the principle applies - For Clickhouse not a topology that's recommended.

Consider this simplified example:

<shard>
    // These two are replicas of each other
    <replica>
        <host>cluster_node_0</host>
    </replica>
    **<replica>
        <host>cluster_node_2</host>
    </replica>**
</shard>
<shard>
    <replica>
        <host>cluster_node_1</host>
    </replica>
    <replica>
        <host>cluster_node_0</host>
    </replica>
</shard>
<shard>
    **<replica>
        <host>cluster_node_2</host>
    </replica>**
    <replica>
        <host>cluster_node_1</host>
    </replica>
</shard>

Let's suppose data is written into the first shard on node cluster_node_0. It will then be replicated to the shard on cluster_node_2 - as the zookeeper path is the same.

Now for the issue. You have also defined the 3rd shard on cluster_node_2. When you create this table, it will physically contain data from 2 shards - the 1st and 3rd - I've attempted to highlight with **.

When a query comes in, it will be sent to each shard. The challenge is each local table will respond with results from both shards - hence you get duplicates.

Generally, avoid more than one shard on a host - the blog explains how you can achieve more than one buts its not recommended or ever need.

Upvotes: 2

Slach
Slach

Reputation: 2450

ClickHouse show duplicates cause you use the same hosts in multiple shards

During execution of SELECT your query, it rewrites and execute in one replica in each shard.

Because same replica presents in different shards and query run twice.

Usually shard means data is not intersected between other shards

If you want a cluster for 3 shards and 2 replicas in each shard You need 6 different replicas clickhouse-0..5

Upvotes: 1

Related Questions