Reputation: 1
We're building up a ClickHouse cluster (version 20.1.8.41) on 7 nodes, using a "circular replica" pattern (i.e. 7 shards * 2 replicas on different nodes), with an extra ZooKeeper cluster.
The /etc/hosts files are all correctly configured, and the cluster started succcessfully.
However, when we're executing distributed DDL queries, they all hanged and eventually timed out, e.g.:
:) create database ods on cluster sht_ck_cluster_1;
CREATE DATABASE ods ON CLUSTER sht_ck_cluster_1
→ Progress: 0.00 rows, 0.00 B (0.00 rows/s., 0.00 B/s.) Received exception from server (version 20.1.8):
Code: 159. DB::Exception: Received from localhost:9002. DB::Exception: Watching task /clickhouse/task_queue/ddl/query-0000000007 is executing longer than distributed_ddl_task_timeout (=180) seconds. There are 14 unfinished hosts (0 of them are currently active), they are going to execute the query in background.
0 rows in set. Elapsed: 180.589 sec.
The clickhouse-server.log on the client node gives information below:
2020.04.23 00:33:33.327414 [ 32 ] {c3c49bd3-333d-4fca-aa2f-2520f5c0cb9f} <Error> executeQuery: Code: 159, e.displayText() = DB::Exception: Watching task /clickhouse/task_queue/ddl/query-0000000007 is executing longer than distributed_ddl_task_timeout (=180) seconds. There are 14 unfinished hosts (0 of them are currently active), they are going to execute the query in background (version 20.1.8.41) (from 127.0.0.1:42198) (in query: CREATE DATABASE ods ON CLUSTER sht_ck_cluster_1), Stack trace (when copying this message, always include the lines below):
0. 0xb2087bc Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) in /usr/bin/clickhouse
1. 0x4d8e3c9 DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) in /usr/bin/clickhouse
2. 0x84846b9 DB::DDLQueryStatusInputStream::readImpl() in /usr/bin/clickhouse
3. 0x8345e3f DB::IBlockInputStream::read() in /usr/bin/clickhouse
4. 0x833d541 DB::AsynchronousBlockInputStream::calculate() in /usr/bin/clickhouse
5. 0x833e113 ? in /usr/bin/clickhouse
6. 0x4dc8b7a ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) in /usr/bin/clickhouse
7. 0x4dc9790 ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()>(void&&, void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()&&...)::'lambda'()::operator()() const in /usr/bin/clickhouse
8. 0x4dc7eca ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) in /usr/bin/clickhouse
9. 0x4dc69dc ? in /usr/bin/clickhouse
10. 0x7e25 start_thread in /usr/lib64/libpthread-2.17.so
11. 0xfebad clone in /usr/lib64/libc-2.17.so
What seems weird is that the clickhouse-server.log on the other nodes says:
2020.04.23 00:30:32.744984 [ 23 ] {} <Debug> DDLWorker: Processing tasks
2020.04.23 00:30:32.744992 [ 24 ] {} <Debug> DDLWorker: Cleaning queue
2020.04.23 00:30:32.746629 [ 23 ] {} <Debug> DDLWorker: Will not execute task query-0000000007: There is no a local address in host list
2020.04.23 00:30:32.746641 [ 23 ] {} <Debug> DDLWorker: Waiting a watch
I'm completely confused about this message. I've tried restarting the cluster, disabling DNS cache, and setting the parameter explicitly, but nothing worked.
What else should I do? Many thanks.
Regards
Upvotes: 0
Views: 2832