Gowri Shankar
Gowri Shankar

Reputation: 161

scanning for rowkeys between start and end time

I have a hbase table where the rowkey pattern is {id1},{id2},{millisec}, I need to get all the rowkeys between start and end millisec keeping either id1 or id2 constant, how do i accomplish in hbase ? I am using a java client.

Thanks

Upvotes: 1

Views: 4151

Answers (2)

Rubén Moraleda
Rubén Moraleda

Reputation: 3067

a. For a known {id1}

You have to perform a scan and provide the start & stop rows. Take a look at this example extracted from the HBase reference guide:

public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR = "attr".getBytes();
...

HTable htable = ...      // instantiate HTable

Scan scan = new Scan();
scan.addColumn(CF, ATTR);
scan.setStartRow(Bytes.toBytes("row")); // start key is inclusive
scan.setStopRow(Bytes.toBytes("rox"));  // stop key is exclusive
ResultScanner rs = htable.getScanner(scan);
try {
  for (Result r = rs.next(); r != null; r = rs.next()) {
  // process result...
} finally {
  rs.close();  // always close the ResultScanner!
}

Additionally, you can use setTimeRange(long minStamp, long maxStamp) to discard rows based on the timestamp.

b. For a known {id2}

The only way to avoid a full table scan is to implement a secondary index (I'm not up to date about this), or go for the classic data redundancy and store the same data also as {id2},{id1},{millisec} (depending on your needs you can avoid some data), which acts as a secondary index.

In case you cannot afford any of the above, you'll have to scan the whole table. To speed up the scan you can:

  1. Use setTimeRange(long minStamp, long maxStamp).
  2. Use a custom filter with a filterRowKey(byte[] buffer, int offset, int length) method to ignore further processing of unwanted rows (every row which doesn't have the {id2} or if the timestamp is not within the range).
  3. Use the FuzzyRowFilter proposed here

The best approach depends on your needs and data, I'll go for implementing a custom filter which could give you a decent performance given the fixed width of your row keys. If it's not enough, I'll go for data redundancy.

Upvotes: 2

Venki
Venki

Reputation: 1459

You can use a Scan

The setTimeRange method can be a good fit for what you're looking for.

public Scan setTimeRange(long minStamp,
                long maxStamp)
                  throws IOException
Get versions of columns only within the specified timestamp range, [minStamp, maxStamp). Note, default maximum versions to return is 1. If your time range spans more than one version and you want all versions returned, up the number of versions beyond the defaut.
Parameters:
minStamp - minimum timestamp value, inclusive
maxStamp - maximum timestamp value, exclusive
Returns:
this
Throws:
IOException - if invalid time range

Upvotes: 0

Related Questions