Dipas
Dipas

Reputation: 304

Pyspark HiveContext.table and HiveContext.sql performance

I would like to know how spark treats these cases and what will be faster and better to use. What is better native count from dataframe, or count within spark sql? It concerns not only for count I assume.

hc = HiveContext(sc)

t0 = time()
a = hc.table('default.testtable')
c = a.count()
tt = time() - t0
print(c)
print("Count for 1st case completed in {} seconds".format(round(tt,3)))

t0 = time()
b = hc.sql('select count(*) from default.testtable')
c1 = b.collect()
tt = time() - t0
print(c1)
print("Count for 2nd case completed in {} seconds".format(round(tt,3)))

If I execute it in a single run it gives:

77707920
Count for 1st case completed in 48.358 seconds
[Row(_c0=77707920)]
Count for 2nd case completed in 20.895 seconds

If I execute separately, before commented 2nd then 1st block, then 1st even faster:

77707920
Count for 1st case completed in 18.717 seconds

[Row(_c0=77707920)]
Count for 2nd case completed in 19.074 seconds

Physical Plans for both are similar:

For collect:

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#9L])
+- TungstenExchange SinglePartition, None
   +- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#12L])
      +- HiveTableScan MetastoreRelation default, testtable, None

For count:

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#9L])
+- TungstenExchange SinglePartition, None
   +- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#12L])
      +- HiveTableScan MetastoreRelation default, testtable, None

How does Spark treats it internaly? Why performance results are different? Which native dataframe count or spark sql count better to use?

Upvotes: 1

Views: 1097

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

There is no difference. Your measurements in the first case are just wrong.

  • The first time block includes context initialization time, executor allocation time and a bunch of the other secondary tasks.
  • The second time block comes when everything is ready.

If you reverse the order you should see that the timings will be opposite, with sql being faster than table.

You may also see some performance improvements with repeated execution as a result of some JVM and system optimizations.

If you want to get reliable estimates run program for each scenario separately and repeat timings multiple times with timeit.

Upvotes: 2

Related Questions