Bob
Bob

Reputation: 661

Spark vs Hive differences with ANALYZE TABLE command -

The ANALYZE TABLE command run from Spark on a Hive table does not give the same performance improvement as the same command issued from Hive.

For example, I have inserted a dataframe into an empty Hive table:-

output.write.insertInto(“XXXXXXXX”)

and then run the analyze table command:-

spark.sql("ANALYZE TABLE XXXXXXXX COMPUTE STATISTICS")

When I do a record count in Hive it is very slow:-

select count(*) from XXXXXXXX;
+-----------+
|    _c0    |
+-----------+
| 12345678  |
+-----------+
1 row selected (36.038 seconds)

But if I run the same analyze table command directly in Hive, performance improves:-

select count(*) from XXXXXXXX;
+-----------+
|    _c0    |
+-----------+
| 12345678  |
+-----------+
1 row selected (0.096 seconds)

Can anyone explain why this difference occurs? Is there a workaround?

Upvotes: 4

Views: 5673

Answers (1)

Clemens Valiente
Clemens Valiente

Reputation: 867

it's because spark is stupid. Spark's ANALYZE TABLE command only writes the computed statistics in a Spark-compatible format to the table properties that Hive is unaware of. Spark completely ignores the standard Hive table statistics when writing those.

if you do a

show create table XXXXXXXX

in Hive after the spark.sql("ANALYZE..") step in your query, you will see the following in the tbl_properties section:

TBLPROPERTIES ( 
  'numFiles'='1', 
  'numRows'='-1', 
  'rawDataSize'='-1', 
  'spark.sql.statistics.numRows'='12345678', 
  'spark.sql.statistics.totalSize'='12345678', 
  'totalSize'='12345678', 

only after doing the same command in hive you will get the correct information:

TBLPROPERTIES ( 
  'numFiles'='1', 
  'numRows'='12345678', 
  'rawDataSize'='12345678', 
  'spark.sql.statistics.numRows'='12345678', 
  'spark.sql.statistics.totalSize'='12345678', 
  'totalSize'='12345678', 

Could Spark fill the numRows and rawDataSize field correctly when computing statistics? Probably. Why does it not do it? No idea. :( Probably the spark devs are above doing optimizations for such lowly systems like Hive. (even though other frameworks like Apache Impala would also benefit from correct stats here - because Impala uses them correctly)

The only workaround is to follow up your Spark job with a corresponding hive job doing the compute statistics command, which is really ugly. :(

Upvotes: 11

Related Questions