Shengjie
Shengjie

Reputation: 12786

Spark on Parquet vs Spark on Hive(Parquet format)

Our use case is a narrow table(15 fields) but large processing against the whole dataset(billions of rows). I am wondering what combination provides better performance:

env: CDH5.8 / spark 2.0

  1. Spark on Hive tables(as format of parquet)
  2. Spark on row files(parquet)

Upvotes: 5

Views: 3850

Answers (2)

OneCricketeer
OneCricketeer

Reputation: 191728

There's only two options here. Spark on files, or Spark on Hive. SparkSQL works on both, and you should prefer to use the Dataset API, not RDD

If you can define the Dataset schema yourself, Spark reading the raw HDFS files will be faster because you're bypassing the extra hop to the Hive Metastore.

When I did a simple test myself years ago (with Spark 1.3), I noticed that extracting 100000 rows as a CSV file was orders of magnitude faster than a SparkSQL Hive query with the same LIMIT

Upvotes: 4

Igor Berman
Igor Berman

Reputation: 1532

Without additional context of your specific product and usecase - I'd vote for SparkSql on Hive tables for two reasons:

  1. sparksql is usually better than core spark since databricks wrote different optimizations in sparksql, which is higher abstaction and gives ability to optimize code(read about Project Tungsten). In some cases manually written spark core code will be better, but it demands from the programmer deep understanding of the internals. In addition sparksql sometimes is limited and doesn't permit you to control low-level mechanisms, but you can always fallback to work with core rdd.

  2. hive and not files - I'm assuming hive with external metastore. Metastore saves definitions of partitions of your "tables"(in files it could be some directory). This is one of the most important parts for the good performance. I.e. when working with files spark will need to load this info(which could be time consuming - e.g. s3 list operation is very slow). So metastore permits spark to fetch this info in simple and fast way

Upvotes: 3

Related Questions