akhil pathirippilly
akhil pathirippilly

Reputation: 1060

Count(*) over same External table gives different values in spark.sql() and hive

I am working on a AWS cluster with hive and spark. I have faced a weird situation previous day while I was running some ETL pyspark script over an external table in hive.

We have a control table which is having an extract date column. And we are filtering data from a staging table (managed table in hive, but location is s3 bucket) based on the extract date and loading to a target table which is an external table with data located in s3 bucket. We are loading the table as below

spark.sql("INSERT OVERWRITE target_table select * from DF_made_from_stage_table")

Now when I have checked the count(*) over target table via spark as well as via direct hive CLI, both are giving different count

in spark:

 spark.sql("select count(1) from target") -- giving 50K records

in hive:

 select count(1) from target -- giving a count 50k - 100 

Note: There was happening an intermittent issue with statistics over external table which was giving -1 as count in hive. This we have resolved by running

ANALYZE TABLE target COMPUTE STATISTICS

But even after doing all these still we are getting original_count-100 in hive where correct count in spark.

Upvotes: 2

Views: 2639

Answers (1)

akhil pathirippilly
akhil pathirippilly

Reputation: 1060

There was a mistake in the DDL used for external table. "skip.header.line.count"="1" was there in the DDL and we are having 100 output files. so 1 line each file were skipped , which caused original count - 100 in hive while spark calculated it correctly. Removed "skip.header.line.count"="1" and its giving count as expected now.

Upvotes: 1

Related Questions