Reputation: 369
I'm a newbie with hadoop & hive. I want to delete certain rows in my database - which is on hive-hadoop. I know its not supported out of the box, and that hadoop is a read only file system. I'm curious about what are the best approaches for accomplishing this. If anyone has done this before, can they share their learnings/procedures?
Thanks!
Upvotes: 4
Views: 39764
Reputation: 41
INSERT OVERWRITE TABLE mytable SELECT * FROM mytable WHERE 1=0;
The above query first deletes the data files, as OVERWRITE
option is specified in the query. Next it will execute SELECT
to read the records from the table, as condition fails, SELECT
query will get zero records and insert zero records into the table.
Upvotes: 4
Reputation: 21
Or you could insert the rest of of rows (i.e., those you want to keep) into another table and then drop the old table.
Upvotes: 1
Reputation: 1630
In Big Data there really aren't deletes. That said, you can overwrite your table or partition if it isn't too big, or isolate your deletes to a particular partition like JamCon suggests.
For datasets which are not too huge, you can do something like
INSERT OVERWRITE TABLE mytable
SELECT * FROM mytable
WHERE ID NOT IN ( 'delete1', 'delete2', 'delete3');
Upvotes: 13
Reputation: 2333
The best approach is to partition your data such that the rows you want to drop are in a partition unto themselves. You can then drop the partition without impacting the rest of your table. This is a fairly sustainable model, even if your dataset grows quite large.
Upvotes: 3