DZvig
DZvig

Reputation: 139

How does one access an apache_beam.io.fileio.ReadableFile() object?

I am trying to use the apache_beam.io.fileio module in order to read from a file lines.txt and incorporate it into my pipeline.

lines.txt has the following contents:

line1
line2
line3

When I run the following pipeline code:

with beam.Pipeline(options=pipeline_options) as p:

     lines = (
         p
         | beam.io.fileio.MatchFiles(file_pattern="lines.txt")
         | beam.io.fileio.ReadMatches()
     )
     # print file contents to screen
     lines | 'print to screen' >> beam.Map(print)

I get the following output:

<apache_beam.io.fileio.ReadableFile object at 0x000001A8C6C55F08>

I expected

line1
line2
line3

How can I yield my expected result?

Upvotes: 1

Views: 1699

Answers (1)

DZvig
DZvig

Reputation: 139

The resulting PCollection from

p
| beam.io.fileio.MatchFiles(file_pattern="lines.txt")
| beam.io.fileio.ReadMatches()

is a ReadableFile object. In order to access this object, we can use various functions as documented in the apache beam pydoc.

Below we implement read_utf8():

with beam.Pipeline(options=pipeline_options) as p:

    lines = (
        p
        | beam.io.fileio.MatchFiles(file_pattern="lines.txt")
        | beam.io.fileio.ReadMatches()
        | beam.Map(lambda file: file.read_utf8())
    )
    # print file contents to screen
    lines | 'print to screen' >> beam.Map(print)

and we get our expected result:

line1
line2
line3

Upvotes: 6

Related Questions