Is Spark task read entire HDFS block before computing?

Question

I originally thought that the Spark task reads entire HDFS block before computing, but I found that the executor reads HDFS speed differently for each application. According to the principle, the HDFS download speed should be the upper limit of the full network speed, but the actual situation is not like this. It depends on how easy this task is to handle.

For example, my network upper limit is 100MB/S, but in LogisticRegression, one executor(single-core, means only one task can be processed at a time), the HDFS download speed is only 30MB/S. When I add the number of cores in the executor, the HDFS download speed will increase accordingly.

So I think, whether Spark reads HDFS files is similar to a streaming model, compute while reading.

OneCricketeer · Accepted Answer

The Namenode will fetch block locations from the Datanodes and return them to the clients, yes. And then the client (Spark, in this case) will start to process them as streams, plus fetching next blocks at the same time, assuming the file is splittable. As tasks enter completion, their results are operated on, based on your application logic.

Is Spark task read entire HDFS block before computing?

Answers (1)

Related Questions