Rinaz Belhaj
Rinaz Belhaj

Reputation: 825

Performance of spark while reading from hive vs parquet

Assuming I have an external hive table on top parquet/orc files partitioned on date, what would be the performance impact of using

spark.read.parquet("s3a://....").filter("date_col='2021-06-20'")

v/s

spark.sql("select * from table").filter("date_col='2021-06-20'")

After reading into a dataframe, It will be followed by a series of transformations and aggregations.

spark version : 2.3.0 or 3.0.2

hive version : 1.2.1000

number of records per day : 300-700 Mn

My hunch is that there won't be any performance difference while using either of the above queries since parquet natively has most of the optimizations that a hive metastore can provide and spark is capable of using it. Like, predicate push-down, advantages of columnar storage etc.

As a follow-up question, what happens if

  1. The underlying data was csv instead of parquet. Does having a hive table on top improves performance ?
  2. Hive table was bucketed. Does it make sense to read the underlying file system in this case instead of reading from table ?

Also, are there any situations where reading directly from parquet is a better option compared to hive ?

Upvotes: 1

Views: 2905

Answers (1)

kanielc
kanielc

Reputation: 1322

Hive should actually be faster here because they both have pushdowns, Hive already has the schema stored. The parquet read as you have it here will need to infer the merged schema. You can make them about the same by providing the schema.

You can make the Parquet version even faster by navigating directly to the partition. This avoids having to do the initial filter on the available partitions.

So something like this would do it:

spark.read.option("basePath", "s3a://....").parquet("s3a://..../date_col=2021-06-20")

Note this works best if you already have a schema, because this also skips schema merging.

As to your follow-ups:

  1. It would make a huge difference if it's CSV, as it would then have to parse all of the data and then filter out those columns. CSV is really bad for large datasets.
  2. Shouldn't really gain you all that much and may get you into trouble. The metadata that Hive stores can allow Spark to navigate your data more efficiently here than you trying to do it yourself.

Upvotes: 2

Related Questions