Apache Spark: resulting file being created at worker node instead of master node

Question

I configure one master at local pc and a worker node inside virtualbox and the result file has been creating at worker node, instread of sending back to master node, I wonder why is that.

Because my worker node cannot send result back to master node? how to verify that?

I use spark2.2. I use same username for master and worker node. I also configured ssh without password.
I tried --deploy-mode client and --deploy-mode cluster
I tried once then I switched the master/worker node and I got the same result.

val result = joined.distinct()
result.write.mode("overwrite").format("csv")
      .option("header", "true").option("delimiter", ";")
      .save("file:///home/data/KPI/KpiDensite.csv")

also, for input file, I load like this:

val commerce = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true")
  .option("delimiter", "|").load("file:///home/data/equip-serv-commerce-infra-2016.csv").distinct()

but why must I presend the file both at master and worker node at the same position? I don't use yarn or mesos right now.

ernest_k · Accepted Answer

You are exporting to a local file system, which tells Spark to write it on the file system of the machine running the code. On the worker, that will be the file system of the worker machine.

If you want the data to be stored on the file system of the driver (not master, you'll need to know where the driver is running on your yarn cluster), then you need to collect the RDD or data frame and use normal IO code to write the data to a file.

The easiest option, however, is to use a distributed storage system, such as HDFS (.save("hdfs://master:port/data/KPI/KpiDensite.csv")) or export to a database (writing to a JDBC or using a nosql db); if you're running your application in cluster mode.

Apache Spark: resulting file being created at worker node instead of master node

Answers (1)

Related Questions