Reputation: 87

Delete all data from HBase table according to time range?

I am trying to delete all data from HBase table, which has a timestamp older than a specified timestamp. This contains all the column families and rows.

Is there a way this can be done using shell as well as Java API?

Upvotes: 5

Answers (3)

Juvenik

Reputation: 950

If you want to remove the data from the shell and do not want to write Java Client, then you can proceed as follows:

#!/bin/bash
start_time=1607731200000
end_time=1607817600000

row_key_file="/tmp/$start_time-$end_time.rowkey"
touch $row_key_file
now=$(date +'%Y-%m-%d:%H-%M-%S')

echo "$now: scanning records from date range $start_time to $end_time"
echo -e "scan 'YOUR_TABLE_NAME', {TIMERANGE => [$start_time, $end_time]}" | hbase shell -n | awk -F ' ' '{if(length($1) > 20){print $1}}' > $row_key_file

rows_scanned=$(wc -l $row_key_file | cut -d' ' -f1)
echo "Rows scanned: $rows_scanned"
echo "deleting rows"

echo -e "File.foreach('$row_key_file') { |line| key=line.strip; deleteall 'extract_job_results', key; }" | hbase shell -n
now=$(date +'%Y-%m-%d:%H-%M-%S')
echo "$now: Data truncation completed"

the start_time and end_time is the epoch in milliseconds for your start and end time range. This will delete all the rows in the time range.

Upvotes: 0

Ashu Pachauri

Reputation: 1403

HBase has no concept of range delete markers. This means that if you need to delete multiple cells, you need to place delete marker for every cell, which means you'll have to scan each row, either on the client side or server side. This means that you have two options:

BulkDeleteProtocol : This uses a coprocessor endpoint, which means that the complete operation will run on the server side. The link has an example of how to use it. If you do a web search, you can easily find how to enable a coprocessor endpoint in HBase.

Scan and delete: This is a clean and the easiest option. Since you said that you need to delete all column families older than a particular timestamp, the scan and delete operation can be optimized greatly by using server side filtering to read only the first key of each row.

Scan scan = new Scan();
scan.setTimeRange(0, STOP_TS);  // STOP_TS: The timestamp in question
// Crucial optimization: Make sure you process multiple rows together
scan.setCaching(1000);
// Crucial optimization: Retrieve only row keys
FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL,
    new FirstKeyOnlyFilter(), new KeyOnlyFilter());
scan.setFilter(filters);
ResultScanner scanner = table.getScanner(scan);
List<Delete> deletes = new ArrayList<>(1000);
Result [] rr;
do {
  // We set caching to 1000 above
  // make full use of it and get next 1000 rows in one go
  rr = scanner.next(1000);
  if (rr.length > 0) {
    for (Result r: rr) {
      Delete delete = new Delete(r.getRow(), STOP_TS);
      deletes.add(delete);
    }
    table.delete(deletes);
    deletes.clear();
  }
} while(rr.length > 0);

Upvotes: 6

Vikram Singh Chandel

Reputation: 633

Yes, this can be done easily by setting time range to scanner and then deleting the returned result set.

    public class BulkDeleteDriver {
    //Added colum family and column to lessen the scan I/O
    private static final byte[] COL_FAM = Bytes.toBytes("<column family>");
    private static final byte[] COL = Bytes.toBytes("column");
    final byte[] TEST_TABLE = Bytes.toBytes("<TableName>");

    public static void main(final String[] args) throws IOException,
    InterruptedException {
    //Create connection to Hbase
    Configuration conf = null;
    Connection conn = null;

    try {
    conf = HBaseConfiguration.create();
    //Path to HBase-site.xml
    conf.addResource(new Path(hbasepath));
    //Get the connection
    conn = ConnectionFactory.createConnection(conf);
    logger.info("Connection created successfully");
    } 
    catch (Exception e) {
    logger.error(e + "Connection Unsuccessful");
    }

    //Get the table instance
    Table table = conn.getTable(TableName.valueOf(TEST_TABLE));
    List<Delete> listOfBatchDeletes = new ArrayList<Delete>();
    long recordCount = 0;
    // Set scanCache if required
    logger.info("Got The Table : " + table.getName());

    //Get calendar instance and get proper start and end timestamps
    Calendar calStart = Calendar.getInstance();
    calStart.add(Calendar.DAY_OF_MONTH, day);
    Calendar calEnd = Calendar.getInstance();
    calEnd.add(Calendar.HOUR, hour);

    //Get timestamps
    long starTS = calStart.getTimeInMillis();
    long endTS = calEnd.getTimeInMillis();

    //Set all scan related properties
    Scan scan = new Scan();
    //Most important part of code set it properly!
    //here my purpose it to delete everthing Present Time - 6 hours
    scan.setTimeRange(starTS, endTS);
    scan.setCaching(scanCache);
    scan.addColumn(COL_FAM, COL);

    //Scan the table and get the row keys
    ResultScanner resultScanner = table.getScanner(scan);
    for (Result scanResult : resultScanner) {
    Delete delete = new Delete(scanResult.getRow());

    //Create batches of Bult Delete
    listOfBatchDeletes.add(delete);
    recordCount++;
    if (listOfBatchDeletes.size() == //give any suitable batch size here) {
    System.out.println("Firing Batch Delete Now......");
    table.delete(listOfBatchDeletes);
    //don't forget to clear the array list
    listOfBatchDeletes.clear();
    }}
    System.out.println("Firing Final Batch of Deletes.....");
    table.delete(listOfBatchDeletes);
    System.out.println("Total Records Deleted are.... " + recordCount);
    try {
    table.close();
    } catch (Exception e) {
    e.printStackTrace();
    logger.error("ERROR", e);
    }}}

Upvotes: 0

Delete all data from HBase table according to time range?

Answers (3)

Related Questions