Reputation: 61
I am beginner in Spark and trying to understand the mechanics of spark dataframes. I am comparing performance of sql queries on spark sql dataframe when loading data from csv verses parquet. My understanding is once the data is loaded to a spark dataframe, it shouldn't matter where the data was sourced from (csv or parquet). However I see significant performance difference between the two. I am loading the data using the following commands and there writing queries against it.
dataframe_csv = sqlcontext.read.format("csv").load()
dataframe_parquet = sqlcontext.read.parquet()
Please explain the reason for the difference.
Upvotes: 3
Views: 7725
Reputation: 77
The reason because you see differente performance between csv & parquet is because parquet has a columnar storage and csv has plain text format. Columnar storage is better for achieve lower storage size but plain text is faster at read from a dataframe.
Upvotes: 1