Reputation: 1487
I am using Hive bundled with Spark. My Spark Streaming job writes 250 Parquet files to HDFS per batch job, in the form of /hdfs/nodes/part-r-$partition_num-$job_hash.gz.parquet. This means that after 1 job, I have 250 files in HDFS, and after 2, I have 500. My external Hive table, created using Parquet, points at /hdfs/nodes for it's location, but it doesn't update to include the data in the new files after I rerun the program.
Do Hive external tables include new files in the table, or only updates to existing files that were there when the table was made?
Also see my related question about automatically updating tables using Hive.
Upvotes: 0
Views: 3382
Reputation: 1487
This is a bit of a hack, but I did eventually get Hive to detect new files using new partitions and MSCK REPAIR TABLE tablename
, which detects the new partitions after they have been created.
This does not fix the original issue, as I have to create a new partition each time I have new files I want in Hive, but it has allowed me to move forward.
Upvotes: 1
Reputation: 238
You need to issue REFRESH table_name
or INVALIDATE METADATA [[db_name.]table_name]
command so Hive metadata are updated to include those new files.
This solution assumes you have Impala running.
Upvotes: 0