CODEWITHSUNDEEP

javaapache-beam

kaxil

Reputation: 18884

Apache Beam: What is the difference between DoFn and SimpleFunction?

While reading about processing streaming elements in apache beam using Java, I came across DoFn<InputT, OutputT> and then across SimpleFunction<InputT, OutputT>.

Both of these look similar to me and I find it difficult to understand the difference.

Can someone explain the difference in layman terms?

Upvotes: 12

Views: 9411

Answers (1)

Reputation: 2539

Conceptually you can think of SimpleFunction is a simple case of DoFn:

SimpleFunction<InputT, OutputT>:
- simple input to output mapping function;
- single input produces single output;
- statically typed, you have to @Override the apply() method;
- doesn't depend on computation context;
- can't use Beam state APIs;
- example use case: MapElements.via(simpleFunction) to convert/modify elements one by one, producing one output for each element;
DoFn<InputT, OutputT>:
- executed with ParDo;
- exposed to the context (timestamp, window pane, etc);
- can consume side inputs;
- can produce multiple outputs or no outputs at all;
- can produce side outputs;
- can use Beam's persistent state APIs;
- dynamically typed;
- example use case: read objects from a stream, filter, accumulate them, perform aggregations, convert them, and dispatch to different outputs;

You can find more specific examples and use cases for ParDos in the dev guide.

This part mentions the MapElements, which is the use case for SimpleFunctions

Upvotes: 15

Related Questions