Druid's behavior when the deep storage is unreachable

Question

What is the behavior of Druid when the deep storage nodes are no more reachable? Is it able to expose data? For how long? Can we at least query the newly ingested data?

Context:

Druid version 0.19.0;
HDFS is used as a deep storage for Druid;
Data is ingested in "realtime" through Kafka;
Data is queried regularly every 5 minutes.

Vaibhav · Accepted Answer

To understand what possibly should happen, Let's understand few basics -

What is Deep storage for Druid - Deep storage is where segments are stored. This deep storage infrastructure defines the level of durability of your data, as long as Druid processes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented.

How segments are published in the deep storage- It's done by druid indexing tasks ( either through batch ingestion tasks or real-time ingestion tasks like Kafka indexing).

How, when and where the segments are available in druid for query-: Let's break this into two parts, one in case of batch ingestion and the other for Streaming ingestion -

(1)Batch ingestion: In this case, the indexing task process the data, creates the segments -> publishes the segments in deep storage. Once the segments are published in the deep storage. Based on the load rules.The segments are copied from deep storage to Historical's local segments cache directory. Based on the query needs, historical's will load the segments in memory from their local segments-cache, compute and serve the query result.

(2)Real-time ingestion - In short, As long as segments are not created/published into deep storage the real-time query will be served from the running real-time indexing task. As soon as the real-time task reads the rows from the Kafka/kinesis topic/stream, they should be available for query. Once the segments are created they will be copied first into the deep storage and then copied to historicals.

To answer your questions -

What is the behavior of Druid when the deep storage nodes are no more reachable? Is it able to expose data? For how long? Can we at least query the newly ingested data?

I am assuming, The deep storage is inaccessible to druid (i.e druid can not see the deep storage). In that case, in general, the following effects could be seen-

(a) The ingestion task should fail as they won't be able to publish their segments in the deep storage.

(b) Druid won't be able to load/drop the segments as per the load rules (i.e loading the segments from deep storage/or dropping the segments based on load rules).

(c) For the real-time indexing task, I think, You should be able to query the real-time data up to the task duration, the reason is, up to taskDuration interval druid indexing task reads the data from your real-time stream, till that point, I don't see any interaction with the deep storage ( theoretically) but the indexing task will fail post-task duration as it will try to publish the segments in the deep storage and as it's not accessible.

However, you should be able to query whatever segments you already have loaded in your druid cluster (i.e in the historical's local segment-cache)

I think You should fix the deep storage accessibility issue to keep everything green and on track.

Druid's behavior when the deep storage is unreachable

Answers (1)

Related Questions

Druid&#39;s behavior when the deep storage is unreachable

Answers (1)

Related Questions

Druid's behavior when the deep storage is unreachable