ZachB
ZachB

Reputation: 15366

Temp files in Google Cloud Dataflow

I'm trying to write temporary files on the workers executing Dataflow jobs, but it seems like the files are getting deleted while the job is still running. If I SSH into the running VM, I'm able to execute the exact same file-generating command and the files are not destroyed -- perhaps this is a cleanup that happens for the dataflow runner user only. Is it possible to use temp files or is this a platform limitation?

Specifically, I'm attempting to write to the location returned by Files.createTempDir(), which is /tmp/someidentifier.

Edit: Not sure what was happening when I posted, but Files.createTempDirectory() works...

Upvotes: 3

Views: 4115

Answers (2)

Jeremy Lewi
Jeremy Lewi

Reputation: 6776

We make no explicit guarantee about the lifetime of files you write to the local disk.

That said, writing to a temporary file inside ProcessElement will work. You can write and read from it within the same ProcessElement. Similarly, any files created in DoFn.startBundle will be visible in processElement and finishBundle.

You should avoid writing to /dataflow/logs/taskrunner/harness. Writing files there might conflict with Dataflow's logging. We encourage you to use the standard Java APIs File.createTempFile() and File.createTempDirectory() instead.

If you want to preserve data beyond finishBundle you should write data to durable storage such as GCS. You can do this by emitting data as a sideOutput and then using TextIO or one of the other writers. Alternatively, you could just write to GCS directly from inside your DoFn.

Since Dataflow runs inside containers you won't be able to see the files by ssh'ing into the VM. The container has some of the directories of the host VM mounted, but /tmp is not one of them. You would need to attach to the appropriate container e.g. by running

docker exec -t -i <CONTAINER ID> /bin/bash

That command would start a shell inside a running container.

Upvotes: 8

jkff
jkff

Reputation: 17913

Dataflow workers run in a Docker container on the VM, which has some of the directories of the host VM mounted, but apparently /tmp is not one of them.

Try writing your temp files, e.g., to /dataflow/logs/taskrunner/harness, which will be mapped to /var/log/dataflow/taskrunner/harness on the host VM.

Upvotes: 2

Related Questions