dataapp
dataapp

Reputation: 61

Spark dataframe CSV vs Parquet

I am beginner in Spark and trying to understand the mechanics of spark dataframes. I am comparing performance of sql queries on spark sql dataframe when loading data from csv verses parquet. My understanding is once the data is loaded to a spark dataframe, it shouldn't matter where the data was sourced from (csv or parquet). However I see significant performance difference between the two. I am loading the data using the following commands and there writing queries against it.

dataframe_csv = sqlcontext.read.format("csv").load()

dataframe_parquet = sqlcontext.read.parquet()

Please explain the reason for the difference.

Upvotes: 3

Views: 7725

Answers (1)

The reason because you see differente performance between csv & parquet is because parquet has a columnar storage and csv has plain text format. Columnar storage is better for achieve lower storage size but plain text is faster at read from a dataframe.

Upvotes: 1

Related Questions