Count(*) over same External table gives different values in spark.sql() and hive

Question

I am working on a AWS cluster with hive and spark. I have faced a weird situation previous day while I was running some ETL pyspark script over an external table in hive.

We have a control table which is having an extract date column. And we are filtering data from a staging table (managed table in hive, but location is s3 bucket) based on the extract date and loading to a target table which is an external table with data located in s3 bucket. We are loading the table as below

spark.sql("INSERT OVERWRITE target_table select * from DF_made_from_stage_table")

Now when I have checked the count(*) over target table via spark as well as via direct hive CLI, both are giving different count

in spark:

 spark.sql("select count(1) from target") -- giving 50K records

in hive:

 select count(1) from target -- giving a count 50k - 100

Note: There was happening an intermittent issue with statistics over external table which was giving -1 as count in hive. This we have resolved by running

ANALYZE TABLE target COMPUTE STATISTICS

But even after doing all these still we are getting original_count-100 in hive where correct count in spark.

Count(*) over same External table gives different values in spark.sql() and hive

Answers (1)

Related Questions