Reputation: 20200
Apache beam has a ton of great documentation but I don't see the code that is run to create a pipeline vs. what code the worker runs. I guess I see this code will be run once but will it also be run by every worker that starts up..
public static void main(String[] args) {
// Create the pipeline.
PipelineOptions options =
PipelineOptionsFactory.fromArgs(args).create();
Pipeline p = Pipeline.create(options);
// Create the PCollection 'lines' by applying a 'Read' transform.
PCollection<String> lines = p.apply(
"ReadMyFile", TextIO.read().from("gs://some/inputData.txt"));
}
Upvotes: 1
Views: 442
Reputation: 918
This is a great question and is at the heart of what Apache Beam is.
tl;dr There is no user-defined entry-point that is called when a worker starts up.
Long answer
When you code using the Apache Beam SDK (using applies, etc.) what you are really doing is creating a graph in the background containing all the applied transformations, see the documentation here. Thus, once p.run()
is called, the graph is sent to the worker to be executed. The transformations on the graph are then broken down into component pieces and executed in order.
As for the code you wrote in your question, that will only be run once. That code is run exactly once when you execute the jar. The transformations in the graph, however, are run for each element in your data (or more, or less depending on your graph).
If you are curious about the implementation of how your transforms and user-defined functions (ParDos) are executed, then the entry-point lives in the Apache Beam SDK here.
Upvotes: 1