Reputation: 623
I have 2 instances for the same data.
Consider the following code:
val myCoolDataSet = spark
.sql("select * from myData")
.select("col1", "col2")
.as[MyDataSet]
.filter(x => x.col1 == "Dummy")
And this one:
val myCoolDataSet = spark
.read
.parquet("path_to_file")
.select("col1", "col2")
.as[MyDataSet]
.filter(x => x.col1 == "Dummy")
My question is what is better in terms of performance and amount of scanned data? How spark computes it for the 2 different approaches?
Upvotes: 5
Views: 4209
Reputation: 12804
Hive serves as a storage for metadata about the Parquet file. Spark can leverage the information contained therein to perform interesting optimizations. Since the backing storage is the same you'll probably not see much difference, but the optimizations based on the metadata in Hive can give an edge.
Upvotes: 10