Reputation: 7623
I'm trying to understand what "Bytes received / sent" as displayed by the Apache Flink dashboard means. For some context, CSV files are hosted on HDFS servers and I am writing the result to a TXT file locally on my machine. Flink is also running locally on my machine. With this in mind, "Bytes sent" seems to mean "Bytes sent from HDFS server to my machine" and "Bytes received" seems to mean "Bytes sent from my machine to HDFS server". Is this the correct interpretation?
I'm also a little confused by the overlapping tasks shown by the timeline. It seems strange that the join begins before filtering of the first two datasets has completed. Is this the expected behaviour and if so why?
Below is my execution plan for some context on what is happening.
Upvotes: 2
Views: 1548
Reputation: 43439
"Bytes received" for a Flink operator refers to the incoming data, and "bytes sent" refers to the outgoing data. In other words, you've got it backwards: bytes received by the data sources are the bytes received from HDFS, and bytes sent from the sink are bytes written to the TXT file.
However, as explained in this answer, Flink does not provide bytes received statistics for sources, or bytes sent for sinks, which is why these figures are zero. BTW, there are plans to improve on this for a future release.
As for the overlapping, concurrent computation in the dataflow pipeline -- well, yes, this is an important feature of Flink's design, which can support continuous, streaming dataflows. When executing a batch workload this isn't necessary, but doesn't hurt.
Upvotes: 5