Reputation: 69
We are receiving Hourly JSON data into HDFS. The size of the data would be approx 5-6 GB per hour.
when matched record found on the final table then Update (or) Delete
if the record not matched in the final dataset then insert the record.
We have tried the Hive merge option for the USE case . this is taking more than an hour to process the merge operation in Hive . Is there any other alternative approach to resolve the use case.So basically every day we are adding 150GB of data into hive , Every other day We have to scan 150Gb of data to find whether we need to do update/insert
What is best way to do Upserts(Updates and Inserts in Hadoop) for large dataset. Hive or HBase or Nifi . What is flow.
Upvotes: 2
Views: 142
Reputation: 328
We are using uber's Hoodie library for a similar use case. It uses spark library with partition and bloom filter index for faster merging. It supports Hive and Presto.
DeltaStreamer Tool can be used for quick setup and initial testing
Upvotes: 1