kassnl
kassnl

Reputation: 259

Pyspark randomly fails to write tos3

Writing my word2vec model to S3 as following:

model.save(sc, "s3://output/folder")

I does it without problems usually, so no AWS credentials problem, but I randomly get the following error.

17/01/30 20:35:21 WARN ConfigurationUtils: Cannot create temp dir with proper permission: /mnt2/s3 java.nio.file.AccessDeniedException: /mnt2 at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) at java.nio.file.Files.createDirectory(Files.java:674) at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) at java.nio.file.Files.createDirectories(Files.java:767) at com.amazon.ws.emr.hadoop.fs.util.ConfigurationUtils.getTestedTempPaths(ConfigurationUtils.java:216) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.initialize(S3NativeFileSystem.java:447) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:111) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2717) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2751) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2733) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:377) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.(FileOutputCommitter.java:113) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.(FileOutputCommitter.java:88) at org.apache.parquet.hadoop.ParquetOutputCommitter.(ParquetOutputCommitter.java:41) at org.apache.parquet.hadoop.ParquetOutputFormat.getOutputCommitter(ParquetOutputFormat.java:339)

Have tried in various clusters and haven't managed to figure it out. Is this a known problem with pyspark?

Upvotes: 1

Views: 1301

Answers (1)

zero323
zero323

Reputation: 330063

This is probably related to SPARK-19247. As of today (Spark 2.1.0), ML writers repartition all data to a single partition and it can result in failures in case of large models. If this is indeed the source of the problem you can try to patch your distribution manually using code from the corresponding PR.

Upvotes: 1

Related Questions