Dnaiel
Dnaiel

Reputation: 7832

why spark reads and writes so fast from S3

I understand the advantage of spark in terms of processing large scale data in parallel and in-mem.

But how does it not hit a bottleneck in terms of read/write to S3 when reading/writing data from/to S3. Is that handled in some efficient form by the S3 storage service? Is S3 distributed storage? Please provide some explanation and if possible links on how to learn more about this.

Upvotes: 2

Views: 3246

Answers (2)

stevel
stevel

Reputation: 13430

Apache Spark talks to S3 via the client library from Amazon on EMR, or from the Apache Hadoop team elsewhere. If you use s3a:// URLs, you are using the most recent ASF client.

We've been doing a lot of work there on speeding things up, see HADOOP-11694.

The performance killers have turned out to be

  1. Excessive numbers of HEAD requests when working out files exist (too many checks in the code). Fix: cut down on these

  2. Closing and reopening connections on seeks. Fix: (a) lazy seek (only do the seek on the read(), not the seek() call), (b) forward seek by reading and discarding data. Efficient even up to a few hundred KB (YMMV, etc)

  3. For binary ORC/Parquet files, adding a special fadvise=random mode, which doesn't attempt a full GET of the source file, instead reads in blocks. If we need to seek back or a long-way forward, the rest of the block discarded and the HTTP 1.1 connection reused: no need to abort the connection and renegotiate a new one.

Some detail is in this talk from last month: Spark and Object Stores, though it doesn't go into the new stuff (in Hadoop 2.8 (forthcoming), HDP 2.5 (shipping), maybe in CDH some time) in depth. It will recommend various settings for performance though, which are valid today.

Also do make sure any compression you use is splittable (LZO, snappy, ...), and that your files not so small that there's too much overhead in listing the directory and opening them.

Upvotes: 3

John Rotenstein
John Rotenstein

Reputation: 269161

The only bottlenecks within AWS are:

Throughput within a Region, such as between Amazon EC2 and Amazon S3, is extremely high and is unlikely to limit your ability to transfer data (aside from the EC2 network bandwidth limitation mentioned above).

Amazon S3 is distributed over many servers across multiple Availability Zones within a Region. At very high speeds, Amazon S3 does have some recommended Request Rate and Performance Considerations, but this is only when making more than 300 PUT/LIST/DELETE requests per second or more than 800 GET requests per second for a particular bucket.

Apache Spark is typically deployed across multiple nodes. Each node has network bandwidth available based on its Instance Type. The parallel nature of Spark means that it can transfer data to/from Amazon S3 much faster than could be done by a single instance.

Upvotes: 5

Related Questions