Which node of a apache spark server reads a node from the disk

Question

I am quite new to apache spark but i think i kind of got the concept. But i dont really understand which node of the apache spark cluster is reading in the resources.

For example when i read a file from the disk. I found some documentation and answers on stackoverflow that indicate that every worker will read in the whole file.

If thats the case and i want to do some sort of line processing on multiple workers, every worker would have lines in its memory that it doesnt need because another worker is processing them.

Furthermore what happens when i use spark sql and i query a big table. Does every worker query the database. or is there one worker that does the query and then the answer of the database gets shuffled onto the other workers.

An answer or a link to the part of the documentation which describes that behaviour would be very helpful

Rick Moritz · Accepted Answer

What happens depends on how you read the file:

If you use the SparkSession provided tools to read a DataFrame (see DataFrameReader Documentation) then an execution graph is created which will try to read data node-local. I.e. each Spark executor will read data which resides on the local-to-this executor part of a distributed storage: For example local HDFS-blocks. This requires that you have partitioning information on the data store, and use this to create a DataFrameReader. This is the proper way to use Spark with big data, since it allows near-arbitrary scaling.

If you use Java or Scala File-IO in your Spark code, then one of two things can happen:

If the code is executed on the driver, you will have to parallelize the data you read in from the file using SparkSession.parallelize on a collection you generated from the Data you have read. This is useful for some tests, but will not scale to most cases where Spark makes sense in production.
If the code is executed on the executor (i.e. inside an RDD.map closure) then the file will be read on each executor where that code was run, and be available on each executor in its entirety. This is usually not desirable, unless you have very specific requirements - it also requires that the file is available on each node.

Regarding SparkSQL and querying a table - The query is interpreted on the driver and an execution plan corresponding to the query is generated. This execution plan is then used to distribute the generated stages to those executors which contain the required data to process the stage, as well as make sure that the data is redistributed in such a way, that the following stages can be executed. Since SparkSQL usually does not run against a database, but rather a columnar or row-based file structure, again, each executor ideally only loads the file data that is local to it. If the data is not local, each executor tries to load a number of partitions from the external data store, potentially using some push-down of filter logic. In that case, yes, every worker queries the "database", but only ever for a part of the data, and usually only to read records.

Which node of a apache spark server reads a node from the disk

Answers (1)

Related Questions