kalyan chakravarthy
kalyan chakravarthy

Reputation: 653

Does Spark lock the File while writing to HDFS or S3

I have an S3 location with the below directory structure with a Hive table created on top of it:

s3://<Mybucket>/<Table Name>/<day Partition>

Let's say I have a Spark program which writes data into above table location spanning multiple partitions using the below line of code:

Df.write.partitionBy("orderdate").parquet("s3://<Mybucket>/<Table Name>/")

If another program such as "Hive SQL query" or "AWS Athena Query" started reading data from the table at the same time:

Do they consider temporary files being written?

Does spark lock the data file while writing into S3 location?

How can we handle such concurrency situations using Spark as an ETL tool?

Upvotes: 11

Views: 2577

Answers (2)

stevel
stevel

Reputation: 13480

  1. No locks. Not implemented in S3 or HDFS.
  2. The process of committing work in HDFS is not atomic in HDFS; there's some renaming going on in job commit which is fast but not instantaneous
  3. With S3 things are pathologically slow with the classic output committers, which assume rename is atomic and fast.
  4. The Apache S3A committers avoid the renames and only make the output visible in job commit, which is fast but not atomic
  5. Amazon EMR now has their own S3 committer, but it makes files visible when each task commits, so exposes readers to incomplete output for longer

Upvotes: 3

wypul
wypul

Reputation: 837

Spark writes the output in a two-step process. First, it writes the data to _temporary directory and then once the write operation is complete and successful, it moves the file to the output directory.

Do they consider temporary files being written?

As the files starting with _ are hidden files, you can not read them from Hive or AWS Athena.

Does spark lock the data file while writing into S3 location?

Locking or any concurrency mechanism is not required because of the simple two-step write process of spark.

How can we handle such concurrency situations using Spark as an ETL tool?

Again using the simple writing to temporary location mechanism.

One more thing to note here is, in your example above after writing output to the output directory you need to add the partition to hive external table using Alter table <tbl_name> add partition (...) command or msck repair table tbl_name command else data won't be available in hive.

Upvotes: 1

Related Questions