kaxil
kaxil

Reputation: 18884

Apache Beam: What is the difference between DoFn and SimpleFunction?

While reading about processing streaming elements in apache beam using Java, I came across DoFn<InputT, OutputT> and then across SimpleFunction<InputT, OutputT>.

Both of these look similar to me and I find it difficult to understand the difference.

Can someone explain the difference in layman terms?

Upvotes: 12

Views: 9411

Answers (1)

Anton
Anton

Reputation: 2539

Conceptually you can think of SimpleFunction is a simple case of DoFn:

  • SimpleFunction<InputT, OutputT>:

    • simple input to output mapping function;
    • single input produces single output;
    • statically typed, you have to @Override the apply() method;
    • doesn't depend on computation context;
    • can't use Beam state APIs;
    • example use case: MapElements.via(simpleFunction) to convert/modify elements one by one, producing one output for each element;
  • DoFn<InputT, OutputT>:

    • executed with ParDo;
    • exposed to the context (timestamp, window pane, etc);
    • can consume side inputs;
    • can produce multiple outputs or no outputs at all;
    • can produce side outputs;
    • can use Beam's persistent state APIs;
    • dynamically typed;
    • example use case: read objects from a stream, filter, accumulate them, perform aggregations, convert them, and dispatch to different outputs;

You can find more specific examples and use cases for ParDos in the dev guide.

This part mentions the MapElements, which is the use case for SimpleFunctions

Upvotes: 15

Related Questions