Reputation: 87
I am trying to delete all data from HBase table, which has a timestamp older than a specified timestamp. This contains all the column families and rows.
Is there a way this can be done using shell as well as Java API?
Upvotes: 5
Views: 6996
Reputation: 950
If you want to remove the data from the shell and do not want to write Java Client, then you can proceed as follows:
#!/bin/bash
start_time=1607731200000
end_time=1607817600000
row_key_file="/tmp/$start_time-$end_time.rowkey"
touch $row_key_file
now=$(date +'%Y-%m-%d:%H-%M-%S')
echo "$now: scanning records from date range $start_time to $end_time"
echo -e "scan 'YOUR_TABLE_NAME', {TIMERANGE => [$start_time, $end_time]}" | hbase shell -n | awk -F ' ' '{if(length($1) > 20){print $1}}' > $row_key_file
rows_scanned=$(wc -l $row_key_file | cut -d' ' -f1)
echo "Rows scanned: $rows_scanned"
echo "deleting rows"
echo -e "File.foreach('$row_key_file') { |line| key=line.strip; deleteall 'extract_job_results', key; }" | hbase shell -n
now=$(date +'%Y-%m-%d:%H-%M-%S')
echo "$now: Data truncation completed"
the start_time and end_time is the epoch in milliseconds for your start and end time range. This will delete all the rows in the time range.
Upvotes: 0
Reputation: 1403
HBase has no concept of range delete markers. This means that if you need to delete multiple cells, you need to place delete marker for every cell, which means you'll have to scan each row, either on the client side or server side. This means that you have two options:
Scan and delete: This is a clean and the easiest option. Since you said that you need to delete all column families older than a particular timestamp, the scan and delete operation can be optimized greatly by using server side filtering to read only the first key of each row.
Scan scan = new Scan();
scan.setTimeRange(0, STOP_TS); // STOP_TS: The timestamp in question
// Crucial optimization: Make sure you process multiple rows together
scan.setCaching(1000);
// Crucial optimization: Retrieve only row keys
FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL,
new FirstKeyOnlyFilter(), new KeyOnlyFilter());
scan.setFilter(filters);
ResultScanner scanner = table.getScanner(scan);
List<Delete> deletes = new ArrayList<>(1000);
Result [] rr;
do {
// We set caching to 1000 above
// make full use of it and get next 1000 rows in one go
rr = scanner.next(1000);
if (rr.length > 0) {
for (Result r: rr) {
Delete delete = new Delete(r.getRow(), STOP_TS);
deletes.add(delete);
}
table.delete(deletes);
deletes.clear();
}
} while(rr.length > 0);
Upvotes: 6
Reputation: 633
Yes, this can be done easily by setting time range to scanner and then deleting the returned result set.
public class BulkDeleteDriver {
//Added colum family and column to lessen the scan I/O
private static final byte[] COL_FAM = Bytes.toBytes("<column family>");
private static final byte[] COL = Bytes.toBytes("column");
final byte[] TEST_TABLE = Bytes.toBytes("<TableName>");
public static void main(final String[] args) throws IOException,
InterruptedException {
//Create connection to Hbase
Configuration conf = null;
Connection conn = null;
try {
conf = HBaseConfiguration.create();
//Path to HBase-site.xml
conf.addResource(new Path(hbasepath));
//Get the connection
conn = ConnectionFactory.createConnection(conf);
logger.info("Connection created successfully");
}
catch (Exception e) {
logger.error(e + "Connection Unsuccessful");
}
//Get the table instance
Table table = conn.getTable(TableName.valueOf(TEST_TABLE));
List<Delete> listOfBatchDeletes = new ArrayList<Delete>();
long recordCount = 0;
// Set scanCache if required
logger.info("Got The Table : " + table.getName());
//Get calendar instance and get proper start and end timestamps
Calendar calStart = Calendar.getInstance();
calStart.add(Calendar.DAY_OF_MONTH, day);
Calendar calEnd = Calendar.getInstance();
calEnd.add(Calendar.HOUR, hour);
//Get timestamps
long starTS = calStart.getTimeInMillis();
long endTS = calEnd.getTimeInMillis();
//Set all scan related properties
Scan scan = new Scan();
//Most important part of code set it properly!
//here my purpose it to delete everthing Present Time - 6 hours
scan.setTimeRange(starTS, endTS);
scan.setCaching(scanCache);
scan.addColumn(COL_FAM, COL);
//Scan the table and get the row keys
ResultScanner resultScanner = table.getScanner(scan);
for (Result scanResult : resultScanner) {
Delete delete = new Delete(scanResult.getRow());
//Create batches of Bult Delete
listOfBatchDeletes.add(delete);
recordCount++;
if (listOfBatchDeletes.size() == //give any suitable batch size here) {
System.out.println("Firing Batch Delete Now......");
table.delete(listOfBatchDeletes);
//don't forget to clear the array list
listOfBatchDeletes.clear();
}}
System.out.println("Firing Final Batch of Deletes.....");
table.delete(listOfBatchDeletes);
System.out.println("Total Records Deleted are.... " + recordCount);
try {
table.close();
} catch (Exception e) {
e.printStackTrace();
logger.error("ERROR", e);
}}}
Upvotes: 0