Daan
Daan

Reputation: 31

Huge Multiline Json file is being processed by single Executor

I have a huge json file 35-40GB size, Its a MULTILINE JSON on hdfs. I have made use of .option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50) with Pyspark.

I have bumped up 60 Executors, 16 cores, 16GB Ememory and set memory overhead parameters. Every run the Executors were being lost.

It is perfectly working for smaller files, but not with files > 15 GB I have enough cluster resources.

From the spark UI what I have seen is every time the data is being processed by single executor, all other executors were idle.

I have seen the stages (0/2) Tasks(0/51)

I have re-partitioned the data as well.

Code:

spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
df.count()
df.rdd.glom().map(len).collect()
df.write.... (HDFSLOCATION, format='csv')

Goal: My goal is to apply UDF function on each of the column and clean the data and write to CSV format. Size of dataframe is 8 million rows with 210 columns

Upvotes: 3

Views: 2136

Answers (1)

Jason Heo
Jason Heo

Reputation: 10246

Rule of thumb, Spark's parallelism is based on the number of input files. But you just specified only 1 file (MULTILINE_JSONFILE_.json), so Spark will use 1 cpu for processing following code

spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json')

even if you have 16 cores.

I would recommend that you split a json file into many files.


More precisely, parallelism is base on number of blocks of files if files are stored on HDFS. if MULTILINE_JSONFILE_.json is 40GB, it might have more than 400 blocks if the block size is 128MB. So, Spark tasks should run in parallel if the file is located in HDFS. If you are stuck with parallelism, I think this is because option("multiline", false) is specified.

In databricks documentation, you can see following sentence.

Files will be loaded as a whole entity and cannot be split.

Upvotes: 5

Related Questions