Reputation: 40
The Apache Beam documentation Authoring I/O Transforms - Overview states:
Reading and writing data in Beam is a parallel task, and using ParDos, GroupByKeys, etc… is usually sufficient. Rarely, you will need the more specialized Source and Sink classes for specific features.
Could someone please provide a very basic example of how to do this in Python?
For example, if I had a local folder containing 100 jpeg images, how would I:
Thanks,
Upvotes: 2
Views: 351
Reputation: 779
Here is an example of pipeline https://github.com/apache/beam/blob/fc738ab9ac7fdbc8ac561e580b1a557b919437d0/sdks/python/apache_beam/examples/wordcount.py#L37
In your case, get the names of the file first and then read each file one at a time and write the output. You might also want to push the file names to a groupby to use the parallelization provided by the runner. So in total your pipeline might look something like Read list of filesnames -> Send filenames to Shuffle using GroupBy Key -> Get 1 filename at a time in a pardo -> Read single file, process and write in a pardo
Upvotes: 1