Reputation: 1060
I am working on a AWS cluster with hive and spark. I have faced a weird situation previous day while I was running some ETL pyspark script over an external table in hive.
We have a control table which is having an extract date column. And we are filtering data from a staging table (managed table in hive, but location is s3 bucket) based on the extract date and loading to a target table which is an external table with data located in s3 bucket. We are loading the table as below
spark.sql("INSERT OVERWRITE target_table select * from DF_made_from_stage_table")
Now when I have checked the count(*) over target table via spark as well as via direct hive CLI, both are giving different count
in spark:
spark.sql("select count(1) from target") -- giving 50K records
in hive:
select count(1) from target -- giving a count 50k - 100
Note: There was happening an intermittent issue with statistics over external table which was giving -1 as count in hive. This we have resolved by running
ANALYZE TABLE target COMPUTE STATISTICS
But even after doing all these still we are getting original_count-100 in hive where correct count in spark.
Upvotes: 2
Views: 2639
Reputation: 1060
There was a mistake in the DDL used for external table. "skip.header.line.count"="1" was there in the DDL and we are having 100 output files. so 1 line each file were skipped , which caused original count - 100 in hive while spark calculated it correctly. Removed "skip.header.line.count"="1" and its giving count as expected now.
Upvotes: 1