Faizal
Faizal

Reputation: 363

Spark reading files from ADLS gen1

I have a process that writes (overwrite existing files) to a ADLS Gen1 directory and then another process that initiates a spark job to read the latest overwritten files.

It seems like most of the time spark is not reading the latest updated files. After building some delay (30s - 60s) for the second process to read the files it seems to be working.

What would be the best approach to resolve this issue without introducing any delays?

Appreciate the feedback.

Upvotes: 0

Views: 356

Answers (1)

Jay Gong
Jay Gong

Reputation: 23782

Based on the documents I searched of HDFS, the writer lock does not prevent other clients from reads.Please see below statements:

writers must obtain an exclusive lock for a file before they’d be allowed to write / append / truncate data in those files. Notably, this exclusive lock does NOT prevent other clients from reading the file, (so a client could be writing a file, and at the same time another could be reading the same file).

So,my opinion is that if something is written into file currently which needs period of time,some delay in reading consistency is inevitable.


If you do want to make sure the strong read consistency(don't want client reads hysteretic data) I try to provide a workaround here for your reference: Adding a Redis database before your writes and reads!

No matter when you do read or write operations, firstly please judge whether there is a specific key in the Redis database. If not, write a set of key-value into Redis. Then do business logic processing. Finally don't forget to delete the key.

Although this is may a little bit cumbersome or affecting performance, I think it can meet your needs. BTW,considering that the business logic may fail or crash so that the key is never released, you can add the TTL setting when creating the key to avoid this situation.

Upvotes: 1

Related Questions