unknown
unknown

Reputation: 53

Delta Ingestion in druid from s3

I am doing a POC over druid. I am ingesting data from s3 it takes ~7 mins for 289MB of data with default configurations. Now I have set "maxNumConcurrentSubTasks": 2 and "appendToExisting": true. When try to ingest the same data into druid from s3 it is taking almost equal time as above. I was expecting much lesser time as I have not updated any data and I am trying to append instead of overwriting the complete.

Am I misunderstanding concept of append in druid and also is there any optimum way to do delta ingestion from s3? Any leads would be appreciated.

Upvotes: 1

Views: 808

Answers (1)

Peter Marshall
Peter Marshall

Reputation: 196

In the console, check that the subtasks are running concurrently. You may need to amend your druid.worker.capacity to tell Druid that more cores are available for ingestion.

See https://druid.apache.org/docs/latest/configuration/index.html#middlemanager-configuration.

It's worth checking this doc on updates and how it works. https://druid.apache.org/docs/latest/ingestion/data-management.html#updating-existing-data

There is also this helpful tutorial: https://druid.apache.org/docs/latest/tutorials/tutorial-update-data.html

The Awesome Itai has written a blog post on retention (which is good reading anyway) but it's got a bit in there about delta ingestion... I've never tried his trick but you could do some experiments and let us all know what you find :D :D

https://medium.com/nmc-techblog/data-retention-and-deletion-in-apache-druid-74ffd12398a8

Upvotes: 1

Related Questions