dobbysock1002
dobbysock1002

Reputation: 967

Explain Apache Beam python syntax

I have read through the Beam documentation and also looked through Python documentation but haven't found a good explanation of the syntax being used in most of the example Apache Beam code.

Can anyone explain what the _ , | , and >> are doing in the below code? Also is the text in quotes ie 'ReadTrainingData' meaningful or could it be exchanged with any other label? In other words how is that label being used?

train_data = pipeline | 'ReadTrainingData' >> _ReadData(training_data)
evaluate_data = pipeline | 'ReadEvalData' >> _ReadData(eval_data)

input_metadata = dataset_metadata.DatasetMetadata(schema=input_schema)

_ = (input_metadata
| 'WriteInputMetadata' >> tft_beam_io.WriteMetadata(
       os.path.join(output_dir, path_constants.RAW_METADATA_DIR),
       pipeline=pipeline))

preprocessing_fn = reddit.make_preprocessing_fn(frequency_threshold)
(train_dataset, train_metadata), transform_fn = (
  (train_data, input_metadata)
  | 'AnalyzeAndTransform' >> tft.AnalyzeAndTransformDataset(
      preprocessing_fn))

Upvotes: 68

Views: 9238

Answers (2)

Mike Williamson
Mike Williamson

Reputation: 3240

No one mentioned the _, so just for completeness:

  • There is nothing officially special about the _, but it is taken as good practice to assign a variable that is returned but which you do not care about to _. This makes it obvious to readers of your code that you plan to throw it away.
    • It also reduces memory, since you're throwing away the other instances of assigning to _ when you re-assign it (overwrite it).
  • There is an unofficial role the _ has: because it is the "throwaway" variable, most linters and other code clarity helpers treat it differently.
    • For instance, if you assign a variable use_me and never actually use it, a linter will warn that you have an unused variable. And if you have rigorous code quality restrictions, maybe you cannot even merge your code into production with an unused variable.
    • _ is not caught by the linter (and could be merged into a strict code base) because it is understood to be a throwaway variable, and therefore there is no mistake in your code (at least not in this regard).

Upvotes: 2

rf-
rf-

Reputation: 1493

Operators in Python can be overloaded. In Beam, | is a synonym for apply, which applies a PTransform to a PCollection to produce a new PCollection. >> allows you to name a step for easier display in various UIs -- the string between the | and the >> is only used for these display purposes and identifying that particular application.

See https://beam.apache.org/documentation/programming-guide/#transforms

Upvotes: 88

Related Questions