what is the entry point of the workers in apache beam? (what methods are called?)

Question

Apache beam has a ton of great documentation but I don't see the code that is run to create a pipeline vs. what code the worker runs. I guess I see this code will be run once but will it also be run by every worker that starts up..

    public static void main(String[] args) {
        // Create the pipeline.
        PipelineOptions options =
            PipelineOptionsFactory.fromArgs(args).create();
        Pipeline p = Pipeline.create(options);

        // Create the PCollection 'lines' by applying a 'Read' transform.
        PCollection lines = p.apply(
          "ReadMyFile", TextIO.read().from("gs://some/inputData.txt"));
    }

Cubez · Accepted Answer

This is a great question and is at the heart of what Apache Beam is.

tl;dr There is no user-defined entry-point that is called when a worker starts up.

Long answer

When you code using the Apache Beam SDK (using applies, etc.) what you are really doing is creating a graph in the background containing all the applied transformations, see the documentation here. Thus, once p.run() is called, the graph is sent to the worker to be executed. The transformations on the graph are then broken down into component pieces and executed in order.

As for the code you wrote in your question, that will only be run once. That code is run exactly once when you execute the jar. The transformations in the graph, however, are run for each element in your data (or more, or less depending on your graph).

If you are curious about the implementation of how your transforms and user-defined functions (ParDos) are executed, then the entry-point lives in the Apache Beam SDK here.

what is the entry point of the workers in apache beam? (what methods are called?)

Answers (1)

Related Questions