Accumulo custom Iterator applied at scan and compaction time

Question

I have implemented an iterator (extending WrappingIterator) which does some simple statistical aggregation and rewriting of keys and values. In essence, I start with keys in this form:

key         qualifier:family             value
        |:

and I perform an aggregation over the column family and rewrite into the following format (aggregating the statistic over time, and deriving new stats)

key         qualifier:family             value
        :

This custom iterator is applied to the table at scan time only, and has been working well, but performance is starting to be an issue. I've thought about the following approaches to improving performance:

1) Is there a way to apply this iterator at compaction time? My thought is that the answer is 'no', because if the iterator is on the table for scans, then a scan wouldn't know which sort of data format is being read by the iterator source (i.e. original or rewritten rows). If there's a way to do this, it would be great.

2) Is there a straightforward way to just copy the table to a new table (with the custom iterator applied) such that the new table contains the aggregated data? I don't really want to do kick off a map-reduce job...

3) Is there some other way of doing this that I should be looking at?

Thanks for any and all suggestions.

Christopher · Accepted Answer

The short answer is yes, you can do this at compaction time. However, there are some caveats to that.

You probably should only do it at full major compaction, otherwise your statistics could aggregate data which has been deleted.
Your iterator should distinguish between aggregated and non-aggregated data. It could do this by examining the structure of the key. Maybe you'd want to put this in a separate column family.

The typical way to do something like this with a new table would be to clone the table, add the major compaction iterator, then trigger a full major compaction.

Another way to do this would be to perform a MapReduce to input from one table and output to another.

Accumulo custom Iterator applied at scan and compaction time

Answers (1)

Related Questions