sparkly
sparkly

Reputation: 71

spark "java.io.IOException: No space left on device"

i am running a pyspark job on ec2 cluster with 4 workers. i get this error :

2018-07-05 08:20:44 WARN  TaskSetManager:66 - Lost task 1923.0 in stage 18.0 (TID 21385, 10.0.5.97, executor 3): java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at net.jpountz.lz4.LZ4BlockOutputStream.finish(LZ4BlockOutputStream.java:260)
at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStream.java:190)
at org.apache.spark.serializer.DummySerializerInstance$1.close(DummySerializerInstance.java:65)
at org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:173)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:194)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

i looked at https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html

tried increasing shuffle partitioned - same issue . my data looks fairly evenly partitioned across executors . i want to try the workaround of assigning Null or None to dataframes , the question is if it will indeed remove intermediate shuffle files , and if the linage will not be kept .

for instance if my code looks like this :

df1 = sqlContext.read.parquet(...)
df2= df1.filter()
df3 = df2.groupBy(*groupList).agg(....)

and i will put

df1 = Null

after like 1 - will it save shuffle space , isn't it needed and will be re-computed for df2 , df3 ?

second question - will checkpointing df1 or df2 help by breaking the linage?

what is a feasible solution when dealing with data larger than my storage (around 400GB of raw data processed)

UPDATE removing cache of a dataframe between 2 phases that needs this dataframe helped and i got no errors . i wonder how it help with the intermediate shuffle files .

Upvotes: 1

Views: 13475

Answers (1)

User12345
User12345

Reputation: 5480

I did face the similar situation. The reason is the that while using group by operations and joins data will be shuffled. As this shuffle data is temporary data while executing in spark applications this will be stored in a directory that spark.local.dir in the spark-defaults.conf file is pointing to which normally is a tmp directory with less space.

In general to avoid this error in the spark-defaults.conf file update the spark.local.dir to a location which has more memory.

Upvotes: 5

Related Questions