Haden Hooyeon Lee
Haden Hooyeon Lee

Reputation: 305

temp files remain in GCS after a Dataflow job "succeeded"

My team runs several hourly/daily Dataflow jobs, which mostly read from and write to GCS (that said, we have dozens of recurring Dataflow jobs scheduled to run within a day). Some jobs read the data from GCS which is produced by previous jobs. Roughly once or twice in a week, we face the following issue:

So the reason, as far as we were able to debug is the following:

We are wondering about the following:

I did some search with GCS and Dataflow as keywords, but found nothing close to the issue we are having -- but I might have missed something, so any help will be really appreciated!

Upvotes: 0

Views: 1579

Answers (1)

jkff
jkff

Reputation: 17913

Sorry for the trouble. This is a bug in TextIO.Write, caused by the fact that, when deleting temporary files, it suffers from GCS eventual consistency, and fails to find and delete all of them.

Ffortunately, it still looks at all the correct files when copying temporary files to their final location, so there is no data loss.

We will look into providing a fix.

Note that again due to GCS eventual consistency, job B can also fail to read some outputs produced by A - this will stay true even with the potential fix, and Dataflow has no easy way of addressing this right now. However, the chances of this decrease as you increase the interval between finishing A and starting B.

I would recommend, if possible, to join A and B into a single pipeline, representing this data as an intermediate PCollection. If you need to have this data materialized as text on GCS for other purposes (e.g. manual inspection, serving, processing by a different tool, etc.), you can still also do that from this joint pipeline, just do not use GCS for passing data between one pipeline and another.

Upvotes: 3

Related Questions