for_stack
for_stack

Reputation: 22946

How to do incremental computation with big data

I have a big data set (about 10T), and I need do some hourly updates on these key-value pairs (0.4B keys) with some incremental data (about 300G per hour), e.g. append some events to a user's event list. Also I need remove data that's out-of-date to maintain the size of the data set. Although the incremental data is small (compared to the data set), most of keys in data set will be updated.

By now, I have two solution, however, neither is good enough.

  1. hadoop. Take the data set and the incremental data as input of mapper, and do the updates in reducer. However, since the data set it too large, it's not a good idea to distribute and shuffle the whole data set every hour. We also consider doing daily updates (i.e. only update the data set when all incremental data is ready). But it takes more than one day to finish the whole mapreduce job...

  2. nosql. Keep the data set in a distributed nosql database (in our case, rocksdb), and do the updates with the database. Since most of keys will be updated, we CANNOT do the update with the 'Get-Update-Set' pattern (too many keys, and random read is slow). Luckily, keys in rocksdb are ordered, we can make the incremental data sorted, and do a merge sort with the data set in rocksdb. In this way, we only need do sequential read on rocksdb, which is fast. Compared to the hadoop solution, we don't need shuffle the large data set hourly. Also, we can merge several hours' incremental data before updating the data set, so that we don't need scan the whole data set hourly.

The second solution works good when the data set is NOT that big (i.e. 3T). However, when the data set grows to 10T, there's too much disk read and write, and the system works bad.

I'm wondering is there any good solution for incremental computation for big data, or is there any nosql for batch computation? I really don't think hadoop is a good choice for incremental computation, especially when the incremental data is much smaller than the data set.

Any suggestion? Thanks!

Upvotes: 1

Views: 1065

Answers (1)

Mufaddal Kamdar
Mufaddal Kamdar

Reputation: 249

As mentioned by you, if your data is huge(in TB's) it makes sense to use big data platform.

I would suggest you for a hybrid approach.

  1. Push all the updates/delta into a table in Hive.
  2. find a window hours for batch process to trigger and read the data from hive delta table and update the actual data during batch job.

Now coming to incremental updates using batch job on hive you can find a detailed description on http://hadoopgig.blogspot.com/2015/08/incremental-updates-in-hive.html

Hope this helps

Upvotes: 1

Related Questions