Roshan Fernando
Roshan Fernando

Reputation: 525

Google Dataflow - How to specify the TextIO in java if writing to an On-prem server?

Google Dataflow - How to specify the TextIO if writing to an On-prem server from Dataflow? (Provided that the On-prem server is connected to GCP with Cloud VPN)

pipeline.apply("Writer",TextIO.write().to("XXX.XXX.XXX.XXX://tmp/somedirectory/somefilename");

Does providing the On-prem IP and directory like in above work, when running the Dataflow job? I tried, and the job completed successfully with elements added in the step summary, but i don't see any files written in the on-prem server. [Not sure if it has anything to do with the authentication with the on-prem server]

Upvotes: 0

Views: 971

Answers (2)

RedPandaCurios
RedPandaCurios

Reputation: 2324

Apache beam textio requires a file system to be specified with a schema prefix, eg file:// gs:// hdfs:// . without any of these I believe it defaults to local file.

https://cloud.google.com/blog/products/data-analytics/review-of-input-streaming-connectors-for-apache-beam-and-apache-spark

So given the 'filename' you specified does not have a schema, I suspect that it will be written to the local disk of the workers, which is not very useful!

So, as @ajp suggests, you need to write to, eg, GCS and then get your on-prem server to read from GCS. - you can perhaps use a pub/sub message as a signal to the on-prem server that the results are ready

Upvotes: 2

alp
alp

Reputation: 712

Using the IP address and path this way will not work with TextIO, it would only work with the file path if your run your pipeline in local.

For remote file transfer to an on-premise server from Cloud Dataflow, the best way is to write files in a Cloud Storage bucket first, like so:

pipeline.apply(TextIO.Write.named("WriteFilesOnPremise")
   .to("gs://myPipelineBucket/onPremiseFiles")

Then either directly download files from the bucket to your on-premise filesystem from your local console with the gsutil command, or programmatically with the Cloud Storage Client Library methods, or you can mount the bucket as filesystem with Cloud Storage FUSE on your on-premise system.

Upvotes: 1

Related Questions