Reputation: 653
I have an S3 location with the below directory structure with a Hive table created on top of it:
s3://<Mybucket>/<Table Name>/<day Partition>
Let's say I have a Spark program which writes data into above table location spanning multiple partitions using the below line of code:
Df.write.partitionBy("orderdate").parquet("s3://<Mybucket>/<Table Name>/")
If another program such as "Hive SQL query" or "AWS Athena Query" started reading data from the table at the same time:
Do they consider temporary files being written?
Does spark lock the data file while writing into S3 location?
How can we handle such concurrency situations using Spark as an ETL tool?
Upvotes: 11
Views: 2577
Reputation: 13480
Upvotes: 3
Reputation: 837
Spark writes the output in a two-step process. First, it writes the data to _temporary
directory and then once the write operation is complete and successful, it moves the file to the output directory.
Do they consider temporary files being written?
As the files starting with _
are hidden files, you can not read them from Hive or AWS Athena.
Does spark lock the data file while writing into S3 location?
Locking or any concurrency mechanism is not required because of the simple two-step write process of spark.
How can we handle such concurrency situations using Spark as an ETL tool?
Again using the simple writing to temporary location mechanism.
One more thing to note here is, in your example above after writing output to the output directory you need to add the partition to hive external table using Alter table <tbl_name> add partition (...)
command or msck repair table tbl_name
command else data won't be available in hive.
Upvotes: 1