Reputation: 304
I would like to know how spark treats these cases and what will be faster and better to use. What is better native count from dataframe, or count within spark sql? It concerns not only for count I assume.
hc = HiveContext(sc)
t0 = time()
a = hc.table('default.testtable')
c = a.count()
tt = time() - t0
print(c)
print("Count for 1st case completed in {} seconds".format(round(tt,3)))
t0 = time()
b = hc.sql('select count(*) from default.testtable')
c1 = b.collect()
tt = time() - t0
print(c1)
print("Count for 2nd case completed in {} seconds".format(round(tt,3)))
If I execute it in a single run it gives:
77707920
Count for 1st case completed in 48.358 seconds
[Row(_c0=77707920)]
Count for 2nd case completed in 20.895 seconds
If I execute separately, before commented 2nd then 1st block, then 1st even faster:
77707920
Count for 1st case completed in 18.717 seconds
[Row(_c0=77707920)]
Count for 2nd case completed in 19.074 seconds
Physical Plans for both are similar:
For collect:
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#9L])
+- TungstenExchange SinglePartition, None
+- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#12L])
+- HiveTableScan MetastoreRelation default, testtable, None
For count:
== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#9L])
+- TungstenExchange SinglePartition, None
+- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#12L])
+- HiveTableScan MetastoreRelation default, testtable, None
How does Spark treats it internaly? Why performance results are different? Which native dataframe count or spark sql count better to use?
Upvotes: 1
Views: 1097
Reputation: 35229
There is no difference. Your measurements in the first case are just wrong.
If you reverse the order you should see that the timings will be opposite, with sql
being faster than table
.
You may also see some performance improvements with repeated execution as a result of some JVM and system optimizations.
If you want to get reliable estimates run program for each scenario separately and repeat timings multiple times with timeit
.
Upvotes: 2