Spark dataframe CSV vs Parquet

Question

I am beginner in Spark and trying to understand the mechanics of spark dataframes. I am comparing performance of sql queries on spark sql dataframe when loading data from csv verses parquet. My understanding is once the data is loaded to a spark dataframe, it shouldn't matter where the data was sourced from (csv or parquet). However I see significant performance difference between the two. I am loading the data using the following commands and there writing queries against it.

dataframe_csv = sqlcontext.read.format("csv").load()

dataframe_parquet = sqlcontext.read.parquet()

Please explain the reason for the difference.

Spark dataframe CSV vs Parquet

Answers (1)

Related Questions