I want to scan lots of data (Range based queries), what all optimizations I can do while writing the data so that scan becomes faster?

Question

I have billion of rows in hbase I want to scan million rows at a time. what are the best optimization techniques which I can do to make this scan as fast as possible.

Alexander Kuznetsov · Accepted Answer

We have similar problem, we need to scan million rows by keys and we used the map reduce techniques for this. There is no standard solution for this, so we write a custom input format that extends InputFormat. There is a shot description how we done this.

First you need to create a splits so keys go to machine where the region that contains it located:

public List getSplits(JobContext context) throws IOException {
    context.getConfiguration();

    //read key for scan
    byte[][] filterKeys = readFilterKeys(context);

    if (table == null) {
        throw new IOException("No table was provided.");
    }

    Pair keys = table.getStartEndKeys();
    if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) {
        throw new IOException("Expecting at least one region.");
    }

    List splits = new ArrayList(keys.getFirst().length);
    for (int i = 0; i < keys.getFirst().length; i++) {
        //get key for current region 
        //it should lying between start and end key of region 
        byte[][] regionKeys =
                getRegionKeys(keys.getFirst()[i], keys.getSecond()[i],filterKeys);
        if (regionKeys == null) {
            continue;
        }
        String regionLocation = table.getRegionLocation(keys.getFirst()[i]).
                getServerAddress().getHostname();
        //create a split for region
        InputSplit split = new MultiplyValueSplit(table.getTableName(),
                regionKeys, regionLocation);
        splits.add(split);

    }
    return splits;
}

Class 'MultiplyValueSplit' contains information about keys and tables

public class MultiplyValueSplit extends InputSplit
    implements Writable, Comparable {

    private byte[] tableName;
    private byte[][] keys;
    private String regionLocation;
}

In method createRecordReader in input format class a 'MultiplyValueReader' that contains the logic how read value from table is created.

@Override
public RecordReader createRecordReader(
        InputSplit split, TaskAttemptContext context) throws IOException {
    HTable table = this.getHTable();
    if (table == null) {
        throw new IOException("Cannot create a record reader because of a" +
                " previous error. Please look at the previous logs lines from" +
                " the task's full log for more details.");
    }

    MultiplyValueSplit mSplit = (MultiplyValueSplit) split;
    MultiplyValuesReader mvr = new MultiplyValuesReader();

    mvr.setKeys(mSplit.getKeys());
    mvr.setHTable(table);
    mvr.init();

    return mvr;
}

Class 'MultiplyValuesReader' contains logic about how read data from HTable

public class MultiplyValuesReader 
        extends RecordReader {
    .......

    @Override
    public ImmutableBytesWritable getCurrentKey() {
        return key;
    }

    @Override
    public Result getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if (this.results == null) {
            return false;
        }

        while (this.results != null) {
            if (resultCurrentKey >= results.length) {
                this.results = getNextResults();
                continue;
            }

            if (key == null) key = new ImmutableBytesWritable();
            value = results[resultCurrentKey];
            resultCurrentKey++;

            if (value != null && value.size() > 0) {
                key.set(value.getRow());
                return true;
            }

        }
        return false;
    }

    public float getProgress() {
        // Depends on the total number of tuples
        return (keys.length > 0 ? ((float) currentKey) / keys.length : 0.0f);
    }

    private Result[] getNextResults() throws IOException {
        if (currentKey <= keys.length) {
            return null;
        }

        //using batch for faster scan
        ArrayList batch = new ArrayList(BATCH_SIZE);
        for (int i = currentKey; 
             i < Math.min(currentKey + BATCH_SIZE, keys.length); i++) {
            batch.add(new Get(keys[i]));
        }

        currentKey = Math.min(currentKey + BATCH_SIZE, keys.length);
        resultCurrentKey = 0;
        return htable.get(batch);
    }

}

For more details you can look at source code of classes TableInputFormat, TableInputFormatBase, TableSplit and TableRecordReader.

I want to scan lots of data (Range based queries), what all optimizations I can do while writing the data so that scan becomes faster?

Answers (1)

Related Questions