Reputation: 5
I have implemented an iterator (extending WrappingIterator) which does some simple statistical aggregation and rewriting of keys and values. In essence, I start with keys in this form:
key qualifier:family value
<id> <val1>|<val2>:<time_info> <statistic>
and I perform an aggregation over the column family and rewrite into the following format (aggregating the statistic over time, and deriving new stats)
key qualifier:family value
<id> <val1>:<val2> <derived-statistics>
This custom iterator is applied to the table at scan time only, and has been working well, but performance is starting to be an issue. I've thought about the following approaches to improving performance:
1) Is there a way to apply this iterator at compaction time? My thought is that the answer is 'no', because if the iterator is on the table for scans, then a scan wouldn't know which sort of data format is being read by the iterator source (i.e. original or rewritten rows). If there's a way to do this, it would be great.
2) Is there a straightforward way to just copy the table to a new table (with the custom iterator applied) such that the new table contains the aggregated data? I don't really want to do kick off a map-reduce job...
3) Is there some other way of doing this that I should be looking at?
Thanks for any and all suggestions.
Upvotes: 0
Views: 254
Reputation: 2512
The short answer is yes, you can do this at compaction time. However, there are some caveats to that.
The typical way to do something like this with a new table would be to clone the table, add the major compaction iterator, then trigger a full major compaction.
Another way to do this would be to perform a MapReduce to input from one table and output to another.
Upvotes: 2