Reputation: 2320
I am loading a csv text file from s3 into spark, filtering and mapping the records and writing the result to s3.
I have tried several input sizes: 100k rows, 1M rows & 3.5M rows.
The former two finish successfully while the latter (3.5M rows) hangs in some weird state in which the job stages monitor web app (the one in port 4040) stops , and the command line console gets stuck and does not even respond to ctrl-c. The Master's web monitoring app still responds and shows the state as FINISHED
.
In s3, I see an empty directory with a single zero-sized entry _temporary_$folder$
. The s3 url is given using the s3n://
protocol.
I did not see any error in the logs in the web console. I also tried several cluster sizes (1 master + 1 worker, 1 master + 5 workers) and got to the same state.
Has anyone encountered such an issue? Any idea what's going on?
Upvotes: 5
Views: 2330
Reputation: 111
It's possible you are running up against the 5GB object limitation of the s3n FileSystem
. You may be able to get around this by using s3 FileSystem
(not s3n
), or by partitioning your output.
Here's what the AmazonS3 - Hadoop Wiki says:
S3 Native FileSystem (URI scheme: s3n) A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. [...] The disadvantage is the 5GB limit on file size imposed by S3.
...
S3 Block FileSystem (URI scheme: s3) A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem [...] The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.
...
AmazonS3 (last edited 2014-07-01 13:27:49 by SteveLoughran)
Upvotes: 2