Reputation: 71
i am running a pyspark job on ec2 cluster with 4 workers. i get this error :
2018-07-05 08:20:44 WARN TaskSetManager:66 - Lost task 1923.0 in stage 18.0 (TID 21385, 10.0.5.97, executor 3): java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at net.jpountz.lz4.LZ4BlockOutputStream.finish(LZ4BlockOutputStream.java:260)
at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStream.java:190)
at org.apache.spark.serializer.DummySerializerInstance$1.close(DummySerializerInstance.java:65)
at org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:173)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:194)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
i looked at https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
tried increasing shuffle partitioned - same issue . my data looks fairly evenly partitioned across executors . i want to try the workaround of assigning Null or None to dataframes , the question is if it will indeed remove intermediate shuffle files , and if the linage will not be kept .
for instance if my code looks like this :
df1 = sqlContext.read.parquet(...)
df2= df1.filter()
df3 = df2.groupBy(*groupList).agg(....)
and i will put
df1 = Null
after like 1 - will it save shuffle space , isn't it needed and will be re-computed for df2 , df3 ?
second question - will checkpointing df1 or df2 help by breaking the linage?
what is a feasible solution when dealing with data larger than my storage (around 400GB of raw data processed)
UPDATE removing cache of a dataframe between 2 phases that needs this dataframe helped and i got no errors . i wonder how it help with the intermediate shuffle files .
Upvotes: 1
Views: 13475
Reputation: 5480
I did face the similar situation. The reason is the that while using group by
operations and joins
data will be shuffled. As this shuffle
data is temporary data while executing in spark applications this will be stored in a directory that spark.local.dir
in the spark-defaults.conf
file is pointing to which normally is a tmp
directory with less space.
In general to avoid this error in the spark-defaults.conf
file update the spark.local.dir
to a location which has more memory.
Upvotes: 5