Can a Hive external table detect new Parquet files in HDFS

Question

I am using Hive bundled with Spark. My Spark Streaming job writes 250 Parquet files to HDFS per batch job, in the form of /hdfs/nodes/part-r-$partition_num-$job_hash.gz.parquet. This means that after 1 job, I have 250 files in HDFS, and after 2, I have 500. My external Hive table, created using Parquet, points at /hdfs/nodes for it's location, but it doesn't update to include the data in the new files after I rerun the program.

Do Hive external tables include new files in the table, or only updates to existing files that were there when the table was made?

Also see my related question about automatically updating tables using Hive.

zephyrthenoble · Accepted Answer

This is a bit of a hack, but I did eventually get Hive to detect new files using new partitions and MSCK REPAIR TABLE tablename, which detects the new partitions after they have been created.

This does not fix the original issue, as I have to create a new partition each time I have new files I want in Hive, but it has allowed me to move forward.

Can a Hive external table detect new Parquet files in HDFS

Answers (2)

Related Questions