Use HDFS instead of spark.local.dir

Question

Trying to understand why Spark needs space on the local machine! Is there a way around it? I keep running into ‘No space left on device’. I understand that I can set ‘spark.local.dir’ to a comma-separated list, but is there a way to use HDFS instead?

I am trying to merge two HUGE datasets. On smaller datasets Spark is kicking butt of MapReduce, but I can’t claim victory until I prove with these huge datasets. I am not using YARN. Also, our Gateway nodes (aka Edge nodes) won’t have lot of free space.

Is there a way around this?

Use HDFS instead of spark.local.dir

Answers (1)

Related Questions