Reputation: 113
I have some working knowledge about Python but pretty new to Apache Beam. I have encountered an example from Apache Beam about a simple word count program. The snippet that I'm confused looks like this:
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
# Read the text file[pattern] into a PCollection.
lines = p | ReadFromText(known_args.input)
# Count the occurrences of each word.
counts = (
lines
| 'Split' >> (
beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x)).
with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
# Format the counts into a PCollection of strings.
def format_result(word_count):
(word, count) = word_count
return '%s: %s' % (word, count)
output = counts | 'Format' >> beam.Map(format_result)
# Write the output using a "Write" transform that has side effects.
# pylint: disable=expression-not-assigned
output | WriteToText(known_args.output)
The full version of the code is here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py
I'm very confused by the "|" and ">>" operators used here. What do they mean here? Are they natively supported in Python?
Upvotes: 3
Views: 958
Reputation: 1204
Since this code is written in Beam
, the symbols you are talking about are native to Beam Pipeline
.
|
is the pipeline symbol which indicates the pipeline being addressed to for the given operation: Like in your example, p
is the source pipeline for lines = p | ReadFromText(known_args.input)
and lines
is the source pipeline for
counts = (
lines
| 'Split' >> (
beam.FlatMap(lambda x: re.findall(r'[A-Za-z\']+', x)).
with_output_types(unicode))
| 'PairWithOne' >> beam.Map(lambda x: (x, 1))
| 'GroupAndSum' >> beam.CombinePerKey(sum))
>>
gives a name to a certain operation for ease of reading on the UI.
In your example, 'GroupAndSum' >> beam.CombinePerKey(sum))
, GroupAndSum
is the name of the combine operation and so on.
Read the documentation given by @Klaus D. in the comments for more clarity.
Upvotes: 3