Reputation: 2980
I'm using beam.io.ReadFromText
to process data from textual files.
Parsing the files is more complex than reading by lines (there is some state that needs to be carried and changed from line to line).
Can I make Beam read my file with only one processor? (not parallelized) Any other best practice for these cases?
Upvotes: 2
Views: 877
Reputation: 17913
Yes, you are free to do arbitrary processing of files yourself, using the FileSystems API. This is what ReadFromText
and all other file-based built-in transforms do under the hood.
def ParseFile(name):
with FileSystems.open(name) as f:
... Parse the file and yield elements ...
p | beam.Create(['/path/to/file'])
| beam.FlatMapElements(ParseFile)
Upvotes: 4