Reputation: 7832
I understand the advantage of spark in terms of processing large scale data in parallel and in-mem.
But how does it not hit a bottleneck in terms of read/write to S3 when reading/writing data from/to S3. Is that handled in some efficient form by the S3 storage service? Is S3 distributed storage? Please provide some explanation and if possible links on how to learn more about this.
Upvotes: 2
Views: 3246
Reputation: 13430
Apache Spark talks to S3 via the client library from Amazon on EMR, or from the Apache Hadoop team elsewhere. If you use s3a:// URLs, you are using the most recent ASF client.
We've been doing a lot of work there on speeding things up, see HADOOP-11694.
The performance killers have turned out to be
Excessive numbers of HEAD requests when working out files exist (too many checks in the code). Fix: cut down on these
Closing and reopening connections on seeks. Fix: (a) lazy seek (only do the seek on the read(), not the seek() call), (b) forward seek by reading and discarding data. Efficient even up to a few hundred KB (YMMV, etc)
For binary ORC/Parquet files, adding a special fadvise=random mode, which doesn't attempt a full GET of the source file, instead reads in blocks. If we need to seek back or a long-way forward, the rest of the block discarded and the HTTP 1.1 connection reused: no need to abort the connection and renegotiate a new one.
Some detail is in this talk from last month: Spark and Object Stores, though it doesn't go into the new stuff (in Hadoop 2.8 (forthcoming), HDP 2.5 (shipping), maybe in CDH some time) in depth. It will recommend various settings for performance though, which are valid today.
Also do make sure any compression you use is splittable (LZO, snappy, ...), and that your files not so small that there's too much overhead in listing the directory and opening them.
Upvotes: 3
Reputation: 269161
The only bottlenecks within AWS are:
Throughput within a Region, such as between Amazon EC2 and Amazon S3, is extremely high and is unlikely to limit your ability to transfer data (aside from the EC2 network bandwidth limitation mentioned above).
Amazon S3 is distributed over many servers across multiple Availability Zones within a Region. At very high speeds, Amazon S3 does have some recommended Request Rate and Performance Considerations, but this is only when making more than 300 PUT/LIST/DELETE requests per second or more than 800 GET requests per second for a particular bucket.
Apache Spark is typically deployed across multiple nodes. Each node has network bandwidth available based on its Instance Type. The parallel nature of Spark means that it can transfer data to/from Amazon S3 much faster than could be done by a single instance.
Upvotes: 5