Support for Parquet as an input / output format when working with S3

Question

I've seen a number of questions describing problems when working with S3 in Spark:

many specifically describing issues with Parquet files:

as well as some external sources referring to other issues with Spark - S3 - Parquet combinations. It makes me think that either S3 with Spark or this complete combination may not be the best choice.

Am I into something here? Can anyone provide an authoritative answer explaining:

Current state of the Parquet support with focus on S3.
Can Spark (SQL) fully take advantage of Parquet features like partition pruning, predicate pushdown (including deeply nested schemas) and Parquet metadata Do all of these feature work as expected on S3 (or compatible storage solutions).
Ongoing developments and opened JIRA tickets.
Are there any configuration options which should be aware of when using these three together?

stevel · Accepted Answer

A lot of the issues aren't parquet specific, but that S3 is not a filesystem, despite the APIs trying to make it look like this. Many nominally-low cost operations take multiple HTTPS requests, with the consequent delays.

Regarding JIRAs

HADOOP-11694; S3A phase II —everything you will get in Hadoop 2.8. Much of this is already in HDP2.5, and yes, it has significant benefits.
HADOOP-13204: the todo list to follow.
Regarding spark (and hive), the use of rename() to commit work is a killer. It's used at the end of tasks and jobs, and in checkpointing. The more output you generate, the longer things take to complete. The s3guard work will include a zero-rename committer, but it will take care and time to move things to it.

Parquet? pushdown works, but there are a few other options to speed things up. I list them and others in: http://www.slideshare.net/steve_l/apache-spark-and-object-stores

Support for Parquet as an input / output format when working with S3

Answers (1)

Related Questions