Varshini
Varshini

Reputation: 69

Updates and Inserts

We are receiving Hourly JSON data into HDFS. The size of the data would be approx 5-6 GB per hour.

We have tried the Hive merge option for the USE case . this is taking more than an hour to process the merge operation in Hive . Is there any other alternative approach to resolve the use case.So basically every day we are adding 150GB of data into hive , Every other day We have to scan 150Gb of data to find whether we need to do update/insert

What is best way to do Upserts(Updates and Inserts in Hadoop) for large dataset. Hive or HBase or Nifi . What is flow.

Upvotes: 2

Views: 142

Answers (1)

Saravanan Elumalai
Saravanan Elumalai

Reputation: 328

We are using uber's Hoodie library for a similar use case. It uses spark library with partition and bloom filter index for faster merging. It supports Hive and Presto.

DeltaStreamer Tool can be used for quick setup and initial testing

Upvotes: 1

Related Questions