krc
krc

Reputation: 21

Spark - loading a directory with 500G of data

I am fairly new to Spark and distributed computing. Its all very straight forward to load a csv file or text file that can fit into your driver memory. But here I have a real scenario and I am finding it difficult to figure out the approach.

I am trying to access around 500G of data in S3, this is made up of Zip files. as these are zipfiles, I am using ZipFileInputFormat as detailed here. It makes sure the files are not splitting across partitions.

Here is my code val sc = new SparkContext(conf)

val inputLocation = args(0) 
val emailData = sc.newAPIHadoopFile(inputLocation, classOf[ZipFileInputFormat], classOf[Text], classOf[BytesWritable]);

val filesRDD = emailData.filter(_._1.toString().endsWith(".txt")).map( x => new String(x._2.getBytes))

This runs fine on a input of few 100mb. but as soon as it crosses the memory limit of my cluster, I am getting the outofMemory issue.

what is the correct way to approach this issue? - should I create an RDD for each zip file and save the output to file, and load all the outputs into a seperate RDD later? - Is there a way to load the base directory into Spark context and partitioned

I have a HDP cluster with 5 nodes and a master, each having 15G of memory.

Any answers/pointers/links are highly appreciated

Upvotes: 2

Views: 224

Answers (1)

user6022341
user6022341

Reputation:

zip files are not splittable so processing individual files won't do you any good. If you wont it to scale out you should avoid these completely or at least hard limit size of the archives.

Upvotes: 0

Related Questions