Reputation: 2661
Trying to understand why Spark needs space on the local machine! Is there a way around it? I keep running into ‘No space left on device’. I understand that I can set ‘spark.local.dir’ to a comma-separated list, but is there a way to use HDFS instead?
I am trying to merge two HUGE datasets. On smaller datasets Spark is kicking butt of MapReduce, but I can’t claim victory until I prove with these huge datasets. I am not using YARN. Also, our Gateway nodes (aka Edge nodes) won’t have lot of free space.
Is there a way around this?
Upvotes: 2
Views: 1462
Reputation: 521
While groupByKey operation, Spark just writes into tmpDir serialized partitions. It's plain files (see ShuffledRDD guts, serializer, and so on), writing to HDFS is complicated enough.
Just set 'spark.local.dir' to free volume. This data needs only for this local machine, not for distributive data (like HDFS).
Upvotes: 2