XY6
XY6

Reputation: 3650

Is it possible to read non-text files into a google dataflow pipeline?

I would like to read pdf files into the pipeline. However, I haven't found any apache beam example regarding file formats other than plain text or xml.

Upvotes: 0

Views: 597

Answers (1)

Andrea
Andrea

Reputation: 191

There is no pre-existing PDF reader available in Dataflow or Apache Beam libraries. However, you could use the example of this reader for TensorFlow records as a model to write your own using the PDF parsing library of your choice.

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TFRecordIO.java

Upvotes: 1

Related Questions