user3542930
user3542930

Reputation: 557

Writing parquet data into S3 using saveAsTable does not complete

Using Spark 2.0.2 on EC2 machines, I have been trying to write tables into S3 in parquet format with partitions, but the application never seems to finish. I can see that Spark has written files into the S3 bucket/folder under _temporary, and that once the Spark saveAsTable JOB finishes, the application hangs.

Taking a look at s3 shows that the partitions are generated with data inside the folder partitions (spot checked), but the _temporary folder is still there, and show tables does not include the new table.

Is anyone else experiencing this or has a solution?

Does anyone know what goes on underneath the saveAsTable command?

Upvotes: 1

Views: 874

Answers (1)

stevel
stevel

Reputation: 13480

It's not hanging, it's just having to copy the data from the temporary store to the destination, which takes time of about data/(10 MB/s). Spark is calling Hadoop's FileOutputCommitter to do this, and it thinks its talking to a Filesytsem where rename() is is an instantaneous transaction.

Upvotes: 1

Related Questions