How to do incremental computation with big data

Question

I have a big data set (about 10T), and I need do some hourly updates on these key-value pairs (0.4B keys) with some incremental data (about 300G per hour), e.g. append some events to a user's event list. Also I need remove data that's out-of-date to maintain the size of the data set. Although the incremental data is small (compared to the data set), most of keys in data set will be updated.

By now, I have two solution, however, neither is good enough.

hadoop. Take the data set and the incremental data as input of mapper, and do the updates in reducer. However, since the data set it too large, it's not a good idea to distribute and shuffle the whole data set every hour. We also consider doing daily updates (i.e. only update the data set when all incremental data is ready). But it takes more than one day to finish the whole mapreduce job...
nosql. Keep the data set in a distributed nosql database (in our case, rocksdb), and do the updates with the database. Since most of keys will be updated, we CANNOT do the update with the 'Get-Update-Set' pattern (too many keys, and random read is slow). Luckily, keys in rocksdb are ordered, we can make the incremental data sorted, and do a merge sort with the data set in rocksdb. In this way, we only need do sequential read on rocksdb, which is fast. Compared to the hadoop solution, we don't need shuffle the large data set hourly. Also, we can merge several hours' incremental data before updating the data set, so that we don't need scan the whole data set hourly.

The second solution works good when the data set is NOT that big (i.e. 3T). However, when the data set grows to 10T, there's too much disk read and write, and the system works bad.

I'm wondering is there any good solution for incremental computation for big data, or is there any nosql for batch computation? I really don't think hadoop is a good choice for incremental computation, especially when the incremental data is much smaller than the data set.

Any suggestion? Thanks!

How to do incremental computation with big data

Answers (1)

Related Questions