Reputation: 809
I have the following code in Apache Flink. When I execute it, some parts of my code are run twice. Could some one let me know why is that happening?
DataSet input1 = ...
DataSet input2 = ...
List mappedInput1 = input1
.map(...)
.collect();
DataSet data = input1
.union(input1.filter(...))
.mapPartition(...);
data = data.union(data2).distinct();
data.flatMap(new MapFunc1(data.collect()));
data
.flatMap(new MapFunc2(input2.collect()))
.groupBy(0)
.sum(1)
.print();
Upvotes: 3
Views: 399
Reputation: 18987
Every collect()
and print()
statement eagerly triggers an execution and fetches the result to the client code. Each such call tracks back the whole program to the data sources.
Your code contains three collect()
and one print()
statement. Hence, four individual programs are submitted and executed. Instead of using collect()
, you should have a look at broadcast variables. Broadcast variables distribute a DataSet to each parallel instance of an operator. The computation and distribution happens in the same program and is not routed over the client program. Instead the data is directly exchanged between the workers running the operators.
Upvotes: 4