Jason
Jason

Reputation: 1130

Spark: Reduce memory usage for wholeTextFile

I have many small text files to process using sc.wholeTextFile and encountered out of memory errors. I have 100 text files in gz format and each compressed file is about 10 MB with uncompressed size around 100 MB. The total compressed size is 1.2G and uncompressed one around 150 G. My machine is a 16-core 64G Linux machine, and I'm using spark 3.3.0 local mode.

Here's my code, I first use wholeTextFiles to read each file as a single string, and then split by a series of "-". Then I save the splited strings into a parquet file. Note I've set the minPartitions inside wholeTextFile to be 400 so the final partition would be 100, the largest possible to reduce memory usage for each partition. While the spark query running, I monitored the memory usage using top command and the memory usage of JVM just kept going up as the number of tasks finished. In my opinion, after each task/partition finishes its job, the memory should be released so the total memory should never be more than the total of 16 tasks for 16 files. How should I revise the code and tune the settings to use less memory.

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, functions as f

conf = SparkConf()

conf.set('spark.driver.memory', '50g')

spark = SparkSession.builder.enableHiveSupport().config(conf = conf).getOrCreate()

sc = spark.sparkContext

df1 = sc.wholeTextFiles('folder1'), 400).toDF()

df3 = df1.select(f.explode(f.split('_2', r'-+\n')).alias('value')
      )

df3.write.mode('overwrite').parquet('file1')  

Upvotes: 0

Views: 139

Answers (1)

Andrew
Andrew

Reputation: 11360

You don't need to open full file to work with it

  1. uncompress archive
  2. open single file as stream
  3. and using loop read each 1000 strings or few megabites of data work do what do you need with it (as example save to the end of another file)
  4. do this until end of file
  5. repeat for all archives

Upvotes: 0

Related Questions