SSG
SSG

Reputation: 40

Apache Beam I/O Transforms

The Apache Beam documentation Authoring I/O Transforms - Overview states:

Reading and writing data in Beam is a parallel task, and using ParDos, GroupByKeys, etc… is usually sufficient. Rarely, you will need the more specialized Source and Sink classes for specific features.

Could someone please provide a very basic example of how to do this in Python?

For example, if I had a local folder containing 100 jpeg images, how would I:

  1. Use ParDos to read/open the files.
  2. Run some arbitrary code on the images (maybe convert them to grey-scale).
  3. Use ParDos to write the modified images to a different local folder.

Thanks,

Upvotes: 2

Views: 351

Answers (1)

Ankur
Ankur

Reputation: 779

Here is an example of pipeline https://github.com/apache/beam/blob/fc738ab9ac7fdbc8ac561e580b1a557b919437d0/sdks/python/apache_beam/examples/wordcount.py#L37

In your case, get the names of the file first and then read each file one at a time and write the output. You might also want to push the file names to a groupby to use the parallelization provided by the runner. So in total your pipeline might look something like Read list of filesnames -> Send filenames to Shuffle using GroupBy Key -> Get 1 filename at a time in a pardo -> Read single file, process and write in a pardo

Upvotes: 1

Related Questions