Reputation: 525
Google Dataflow - How to specify the TextIO if writing to an On-prem server from Dataflow? (Provided that the On-prem server is connected to GCP with Cloud VPN)
pipeline.apply("Writer",TextIO.write().to("XXX.XXX.XXX.XXX://tmp/somedirectory/somefilename");
Does providing the On-prem IP and directory like in above work, when running the Dataflow job? I tried, and the job completed successfully with elements added in the step summary, but i don't see any files written in the on-prem server. [Not sure if it has anything to do with the authentication with the on-prem server]
Upvotes: 0
Views: 971
Reputation: 2324
Apache beam textio requires a file system to be specified with a schema prefix, eg file:// gs:// hdfs:// . without any of these I believe it defaults to local file.
So given the 'filename' you specified does not have a schema, I suspect that it will be written to the local disk of the workers, which is not very useful!
So, as @ajp suggests, you need to write to, eg, GCS and then get your on-prem server to read from GCS. - you can perhaps use a pub/sub message as a signal to the on-prem server that the results are ready
Upvotes: 2
Reputation: 712
Using the IP address and path this way will not work with TextIO, it would only work with the file path if your run your pipeline in local.
For remote file transfer to an on-premise server from Cloud Dataflow, the best way is to write files in a Cloud Storage bucket first, like so:
pipeline.apply(TextIO.Write.named("WriteFilesOnPremise")
.to("gs://myPipelineBucket/onPremiseFiles")
Then either directly download files from the bucket to your on-premise filesystem from your local console with the gsutil command, or programmatically with the Cloud Storage Client Library methods, or you can mount the bucket as filesystem with Cloud Storage FUSE on your on-premise system.
Upvotes: 1