Reputation: 661
The ANALYZE TABLE command run from Spark on a Hive table does not give the same performance improvement as the same command issued from Hive.
For example, I have inserted a dataframe into an empty Hive table:-
output.write.insertInto(“XXXXXXXX”)
and then run the analyze table command:-
spark.sql("ANALYZE TABLE XXXXXXXX COMPUTE STATISTICS")
When I do a record count in Hive it is very slow:-
select count(*) from XXXXXXXX;
+-----------+
| _c0 |
+-----------+
| 12345678 |
+-----------+
1 row selected (36.038 seconds)
But if I run the same analyze table command directly in Hive, performance improves:-
select count(*) from XXXXXXXX;
+-----------+
| _c0 |
+-----------+
| 12345678 |
+-----------+
1 row selected (0.096 seconds)
Can anyone explain why this difference occurs? Is there a workaround?
Upvotes: 4
Views: 5673
Reputation: 867
it's because spark is stupid. Spark's ANALYZE TABLE command only writes the computed statistics in a Spark-compatible format to the table properties that Hive is unaware of. Spark completely ignores the standard Hive table statistics when writing those.
if you do a
show create table XXXXXXXX
in Hive after the spark.sql("ANALYZE..") step in your query, you will see the following in the tbl_properties section:
TBLPROPERTIES (
'numFiles'='1',
'numRows'='-1',
'rawDataSize'='-1',
'spark.sql.statistics.numRows'='12345678',
'spark.sql.statistics.totalSize'='12345678',
'totalSize'='12345678',
only after doing the same command in hive you will get the correct information:
TBLPROPERTIES (
'numFiles'='1',
'numRows'='12345678',
'rawDataSize'='12345678',
'spark.sql.statistics.numRows'='12345678',
'spark.sql.statistics.totalSize'='12345678',
'totalSize'='12345678',
Could Spark fill the numRows and rawDataSize field correctly when computing statistics? Probably. Why does it not do it? No idea. :( Probably the spark devs are above doing optimizations for such lowly systems like Hive. (even though other frameworks like Apache Impala would also benefit from correct stats here - because Impala uses them correctly)
The only workaround is to follow up your Spark job with a corresponding hive job doing the compute statistics command, which is really ugly. :(
Upvotes: 11