Spark: Reduce memory usage for wholeTextFile

Question

I have many small text files to process using sc.wholeTextFile and encountered out of memory errors. I have 100 text files in gz format and each compressed file is about 10 MB with uncompressed size around 100 MB. The total compressed size is 1.2G and uncompressed one around 150 G. My machine is a 16-core 64G Linux machine, and I'm using spark 3.3.0 local mode.

Here's my code, I first use wholeTextFiles to read each file as a single string, and then split by a series of "-". Then I save the splited strings into a parquet file. Note I've set the minPartitions inside wholeTextFile to be 400 so the final partition would be 100, the largest possible to reduce memory usage for each partition. While the spark query running, I monitored the memory usage using top command and the memory usage of JVM just kept going up as the number of tasks finished. In my opinion, after each task/partition finishes its job, the memory should be released so the total memory should never be more than the total of 16 tasks for 16 files. How should I revise the code and tune the settings to use less memory.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, functions as f

conf = SparkConf()

conf.set('spark.driver.memory', '50g')

spark = SparkSession.builder.enableHiveSupport().config(conf = conf).getOrCreate()

sc = spark.sparkContext

df1 = sc.wholeTextFiles('folder1'), 400).toDF()

df3 = df1.select(f.explode(f.split('_2', r'-+
')).alias('value')
      )

df3.write.mode('overwrite').parquet('file1')

Spark: Reduce memory usage for wholeTextFile

Answers (1)

Related Questions