boger
boger

Reputation: 623

Spark Dataset on Hive vs Parquet file

I have 2 instances for the same data.

  1. Hive table called myData in parquet format
  2. Parquet file (not managed by Hive) that is in parquet format

Consider the following code:

val myCoolDataSet = spark
    .sql("select * from myData")
    .select("col1", "col2")
    .as[MyDataSet]
    .filter(x => x.col1 == "Dummy")

And this one:

val myCoolDataSet = spark
    .read
    .parquet("path_to_file")
    .select("col1", "col2")
    .as[MyDataSet]
    .filter(x => x.col1 == "Dummy")

My question is what is better in terms of performance and amount of scanned data? How spark computes it for the 2 different approaches?

Upvotes: 5

Views: 4209

Answers (1)

stefanobaghino
stefanobaghino

Reputation: 12804

Hive serves as a storage for metadata about the Parquet file. Spark can leverage the information contained therein to perform interesting optimizations. Since the backing storage is the same you'll probably not see much difference, but the optimizations based on the metadata in Hive can give an edge.

Upvotes: 10

Related Questions