Reputation: 557
Using Spark 2.0.2 on EC2 machines, I have been trying to write tables into S3 in parquet format with partitions, but the application never seems to finish. I can see that Spark has written files into the S3 bucket/folder under _temporary, and that once the Spark saveAsTable JOB finishes, the application hangs.
Taking a look at s3 shows that the partitions are generated with data inside the folder partitions (spot checked), but the _temporary folder is still there, and show tables does not include the new table.
Is anyone else experiencing this or has a solution?
Does anyone know what goes on underneath the saveAsTable command?
Upvotes: 1
Views: 874
Reputation: 13480
It's not hanging, it's just having to copy the data from the temporary store to the destination, which takes time of about data/(10 MB/s). Spark is calling Hadoop's FileOutputCommitter to do this, and it thinks its talking to a Filesytsem where rename() is is an instantaneous transaction.
Upvotes: 1