Huge Multiline Json file is being processed by single Executor

Question

I have a huge json file 35-40GB size, Its a MULTILINE JSON on hdfs. I have made use of .option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50) with Pyspark.

I have bumped up 60 Executors, 16 cores, 16GB Ememory and set memory overhead parameters. Every run the Executors were being lost.

It is perfectly working for smaller files, but not with files > 15 GB I have enough cluster resources.

From the spark UI what I have seen is every time the data is being processed by single executor, all other executors were idle.

I have seen the stages (0/2) Tasks(0/51)

I have re-partitioned the data as well.

Code:

spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
df.count()
df.rdd.glom().map(len).collect()
df.write.... (HDFSLOCATION, format='csv')

Goal: My goal is to apply UDF function on each of the column and clean the data and write to CSV format. Size of dataframe is 8 million rows with 210 columns

Huge Multiline Json file is being processed by single Executor

Answers (1)

Related Questions