user7337271
user7337271

Reputation: 1712

Support for Parquet as an input / output format when working with S3

I've seen a number of questions describing problems when working with S3 in Spark:

many specifically describing issues with Parquet files:

as well as some external sources referring to other issues with Spark - S3 - Parquet combinations. It makes me think that either S3 with Spark or this complete combination may not be the best choice.

Am I into something here? Can anyone provide an authoritative answer explaining:

Upvotes: 5

Views: 895

Answers (1)

stevel
stevel

Reputation: 13480

A lot of the issues aren't parquet specific, but that S3 is not a filesystem, despite the APIs trying to make it look like this. Many nominally-low cost operations take multiple HTTPS requests, with the consequent delays.

Regarding JIRAs

  • HADOOP-11694; S3A phase II —everything you will get in Hadoop 2.8. Much of this is already in HDP2.5, and yes, it has significant benefits.
  • HADOOP-13204: the todo list to follow.
  • Regarding spark (and hive), the use of rename() to commit work is a killer. It's used at the end of tasks and jobs, and in checkpointing. The more output you generate, the longer things take to complete. The s3guard work will include a zero-rename committer, but it will take care and time to move things to it.

Parquet? pushdown works, but there are a few other options to speed things up. I list them and others in: http://www.slideshare.net/steve_l/apache-spark-and-object-stores

Upvotes: 3

Related Questions