Reputation: 3
I have 3 nodes with 3 shards and 2 replicas on each:
Added also the XML config for the sharding and replicas
<default_cluster>
<shard>
<internal_replication>true</internal_replication>
<replica>
<default_database>shard</default_database>
<host>clickhouse-0</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
<replica>
<default_database>replica</default_database>
<host>clickhouse-2</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<default_database>shard</default_database>
<host>clickhouse-1</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
<replica>
<default_database>replica</default_database>
<host>clickhouse-0</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
</shard>
<shard>
<internal_replication>true</internal_replication>
<replica>
<default_database>shard</default_database>
<host>clickhouse-2</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
<replica>
<default_database>replica</default_database>
<host>clickhouse-1</host>
<port>9000</port>
<user>default</user>
<password>default</password>
</replica>
</shard>
</default_cluster>
I am doing the following example:
create database test on cluster default_cluster;
CREATE TABLE test.test_distributed_order_local on cluster default_cluster
(
id integer,
test_column String
)
ENGINE = ReplicatedMergeTree('/default_cluster/test/tables/test_distributed_order_local/{shard}', '{replica}')
PRIMARY KEY id
ORDER BY id;
CREATE TABLE test.test_distributed_order on cluster default_cluster as test.test_distributed_order_local
ENGINE = Distributed(default_cluster, test, test_distributed_order_local, id);
insert into test.test_distributed_order values (1, 'test1');
insert into test.test_distributed_order values (2, 'test2');
insert into test.test_distributed_order values (3, 'test3');
The results are not the same, and they contain duplications: Eg
Result 1:
Result 2:
What am I missing?
I expect to not have duplicated rows in the select
Upvotes: 0
Views: 1091
Reputation: 111
I think this post probably sums up what you're trying to achieve - https://altinity.com/blog/2018/5/10/circular-replication-cluster-topology-in-clickhouse It's a little old but the principle applies - For Clickhouse not a topology that's recommended.
Consider this simplified example:
<shard>
// These two are replicas of each other
<replica>
<host>cluster_node_0</host>
</replica>
**<replica>
<host>cluster_node_2</host>
</replica>**
</shard>
<shard>
<replica>
<host>cluster_node_1</host>
</replica>
<replica>
<host>cluster_node_0</host>
</replica>
</shard>
<shard>
**<replica>
<host>cluster_node_2</host>
</replica>**
<replica>
<host>cluster_node_1</host>
</replica>
</shard>
Let's suppose data is written into the first shard on node cluster_node_0
. It will then be replicated to the shard on cluster_node_2
- as the zookeeper path is the same.
Now for the issue. You have also defined the 3rd shard on cluster_node_2
. When you create this table, it will physically contain data from 2 shards - the 1st and 3rd - I've attempted to highlight with **
.
When a query comes in, it will be sent to each shard. The challenge is each local table will respond with results from both shards - hence you get duplicates.
Generally, avoid more than one shard on a host - the blog explains how you can achieve more than one buts its not recommended or ever need.
Upvotes: 2
Reputation: 2450
ClickHouse show duplicates cause you use the same hosts in multiple shards
During execution of SELECT your query, it rewrites and execute in one replica in each shard.
Because same replica presents in different shards and query run twice.
Usually shard means data is not intersected between other shards
If you want a cluster for 3 shards and 2 replicas in each shard You need 6 different replicas clickhouse-0..5
Upvotes: 1