covfefe
covfefe

Reputation: 2675

Can Hive table automatically update when underlying directory is changed

If I build a Hive table on top of some S3 (or HDFS) directory like so:

create external table newtable (name string) 
row format delimited 
fields terminated by ',' 
stored as textfile location 's3a://location/subdir/';

When I add files to that S3 location, the Hive table doesn't automatically update. The new data is only included if I create a new Hive table on that location. Is there a way to build a Hive table (maybe using partitions) so that whenever new files are added to the underlying directory, the Hive table automatically shows that data (without having to recreate the Hive table)?

Upvotes: 3

Views: 3415

Answers (2)

leftjoin
leftjoin

Reputation: 38290

On HDFS each file scanned each time table being queried as @Dudu Markovitz pointed. And files in HDFS are immediately consistent.

Update: S3 is also strongly consistent now, so removed part about eventual consistency.

Also there may be a problem with using statistics when querying table after adding files, see here: https://stackoverflow.com/a/39914232/2700344

Upvotes: 2

stevel
stevel

Reputation: 13430

Everything @leftjoin says is correct, with one extra detail: s3 doesn't offer immediate consistency on listings. A new blob can be uploaded, HEAD/GET will return it but a list operation on the parent path may not see it. This means that Hive code which lists the directory may not see the data. Using unique names doesn't fix this, only using a consistent DB like Dynamo which is updated as files are added/removed. Even there, you have added a new thing to keep in sync...

Upvotes: 0

Related Questions