Spark - loading a directory with 500G of data

Question

I am fairly new to Spark and distributed computing. Its all very straight forward to load a csv file or text file that can fit into your driver memory. But here I have a real scenario and I am finding it difficult to figure out the approach.

I am trying to access around 500G of data in S3, this is made up of Zip files. as these are zipfiles, I am using ZipFileInputFormat as detailed here. It makes sure the files are not splitting across partitions.

Here is my code val sc = new SparkContext(conf)

val inputLocation = args(0) 
val emailData = sc.newAPIHadoopFile(inputLocation, classOf[ZipFileInputFormat], classOf[Text], classOf[BytesWritable]);

val filesRDD = emailData.filter(_._1.toString().endsWith(".txt")).map( x => new String(x._2.getBytes))

This runs fine on a input of few 100mb. but as soon as it crosses the memory limit of my cluster, I am getting the outofMemory issue.

what is the correct way to approach this issue? - should I create an RDD for each zip file and save the output to file, and load all the outputs into a seperate RDD later? - Is there a way to load the base directory into Spark context and partitioned

I have a HDP cluster with 5 nodes and a master, each having 15G of memory.

Any answers/pointers/links are highly appreciated

Spark - loading a directory with 500G of data

Answers (1)

Related Questions