Reputation: 306
I'm trying to override a HBase method: MultiTableInputFormat.getSplits() I have implementations like this:
public List<InputSplit> getSplits(JobContext context) throws IOException {
List<Scan> scans = getScans();
List<InputSplit> splits = new ArrayList<>();
Scan sampleScan = scans.get(0);
byte[] tableNameBytes = sampleScan.getAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME);
TableName tableName = TableName.valueOf(tableNameBytes);
Table table = null;
RegionLocator regionLocator = null;
Connection conn = null;
conn = ConnectionFactory.createConnection(context.getConfiguration());
table = conn.getTable(tableName);
regionLocator = conn.getRegionLocator(tableName);
regionLocator = (RegionLocator) table;
Pair<byte[][], byte[][]> keys = regionLocator.getStartEndKeys();
RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(
regionLocator, conn.getAdmin()
);
int regionCount = keys.getFirst().length;
for (int i = 0; i < regionCount; i++) {
calculateSplits(
keys.getFirst()[i],
keys.getSecond()[i],
regionLocator,
sizeCalculator,
splits
);
}
return splits;
}
private void calculateSplits(
final byte[] startKey,
final byte[] endKey,
RegionLocator regionLocator,
RegionSizeCalculator sizeCalculator,
List<InputSplit> splits
) throws IOException {
HRegionLocation hregionLocation = regionLocator.getRegionLocation(startKey, false);
String regionHostname = hregionLocation.getHostname();
HRegionInfo regionInfo = hregionLocation.getRegionInfo();
for (Scan scan : getScans()) {
byte[] startRow = scan.getStartRow();
byte[] stopRow = scan.getStopRow();
// determine if the given start and stop keys fall into the range
if (
(startRow.length == 0 || endKey.length == 0 || Bytes.compareTo(startRow, endKey) < 0) &&
(stopRow.length == 0 || Bytes.compareTo(stopRow, startKey) > 0)
) {
byte[] splitStart = startRow.length == 0 || Bytes.compareTo(startKey, startRow) >= 0 ?
startKey : startRow;
byte[] splitStop =
(stopRow.length == 0 || Bytes.compareTo(endKey, stopRow) <= 0) && endKey.length > 0 ?
endKey : stopRow;
long regionSize = sizeCalculator.getRegionSize(regionInfo.getRegionName());
TableSplit split = new TableSplit(
regionLocator.getName(), scan, splitStart, splitStop, regionHostname, regionSize
);
splits.add(split);
}
}
}
The basic idea of this piece of code is get all regions and their start and end key. We also have a list of scans. We will examine all scans * all regions to get all splits. But this piece of code is very slow mostly because we have about 10,000 regions. So the process of scanning and calculating each region's information will take a lot of time.
I noticed that in regionLocator we also have a method named: getAllRegionLocations() I think I can use this method to get all regions at a time and save a lot of time. But the problem is if I use this method, I can't get the corresponding start and end key, then I can't decide the range of split. Any ideas of better solutions to make this method faster?
Upvotes: 3
Views: 867
Reputation: 306
Solved! I found that we can get startkey and endkey from regionInfo. So first get a list, scan all regionLocation in the list and the 2nd method becomes:
private void calculateSplits(
HRegionLocation hRegionLocation,
RegionLocator regionLocator,
RegionSizeCalculator sizeCalculator,
List<InputSplit> splits
) throws IOException {
String regionHostname = hRegionLocation.getHostname();
HRegionInfo regionInfo = hRegionLocation.getRegionInfo();
final byte[] startKey = regionInfo.getStartKey();
final byte[] endKey = regionInfo.getEndKey();
...
}
Upvotes: 1