Reputation: 139
I am trying to use the apache_beam.io.fileio
module in order to read from a file lines.txt
and incorporate it into my pipeline.
lines.txt
has the following contents:
line1
line2
line3
When I run the following pipeline code:
with beam.Pipeline(options=pipeline_options) as p:
lines = (
p
| beam.io.fileio.MatchFiles(file_pattern="lines.txt")
| beam.io.fileio.ReadMatches()
)
# print file contents to screen
lines | 'print to screen' >> beam.Map(print)
I get the following output:
<apache_beam.io.fileio.ReadableFile object at 0x000001A8C6C55F08>
I expected
line1
line2
line3
How can I yield my expected result?
Upvotes: 1
Views: 1699
Reputation: 139
The resulting PCollection
from
p
| beam.io.fileio.MatchFiles(file_pattern="lines.txt")
| beam.io.fileio.ReadMatches()
is a ReadableFile
object. In order to access this object, we can use various functions as documented in the apache beam pydoc.
Below we implement read_utf8()
:
with beam.Pipeline(options=pipeline_options) as p:
lines = (
p
| beam.io.fileio.MatchFiles(file_pattern="lines.txt")
| beam.io.fileio.ReadMatches()
| beam.Map(lambda file: file.read_utf8())
)
# print file contents to screen
lines | 'print to screen' >> beam.Map(print)
and we get our expected result:
line1
line2
line3
Upvotes: 6