Reputation: 115
I'm wondering if there are differences in performance (when reading) between those two commands?:
df.write.format('parquet').partitionBy(xx).save('/.../xx.parquet')
df.write.format('parquet').partitionBy(xx).saveAsTable('...')
I understand that for bucketing the question doesn't arise as it is only used with managed tables (saveAsTable()) ; however, I'm a bit confused regarding partitioning as to if there is a method to privilege.
Upvotes: 4
Views: 1052
Reputation: 115
I've tried to find an answer experimentaly on a small dataframe and here are the results :
ENV = Databricks Community edition
[Attached to cluster: test, 15.25 GB | 2 Cores | DBR 7.4 | Spark 3.0.1 | Scala 2.12]
sqlContext.setConf( "spark.sql.shuffle.partitions", 2)
spark.conf.set("spark.sql.adaptive.enabled","true")
df.count() = 693243
RESULTS :
As expected writing using .saveAsTable() is a bit longer because it has to execute a dedicated "CreateDataSourceTableAsSelectCommand" to actually create the table. However, it is interesting to observe the difference when reading in favor of .saveAsTable() by nearly a factor of x10 in this simple example. I'd be very interested to compare the results on a much larger scale if someone ever has the ability to do it, and to understand what happens under the hood.
Upvotes: 3