ѕтƒ
ѕтƒ

Reputation: 3647

How to split the input data and load it to RDD

I have terabytes of data to process using Apache Spark. I use the code sparkContext.binaryFiles(folderpath) to load all data from the folder. I think it loads the full data to RDD and cause OutOfMemory error.

How can I split the 1TB data to 250GBs and Let the RDD loads it?

Upvotes: 1

Views: 164

Answers (1)

Artur Nowak
Artur Nowak

Reputation: 5354

Unfortunately, binaryFiles loads every file as one entry in RDD. I assume that you have all the data in one file or just a couple of them.

Basically, you have two options:

  • Split the files into smaller ones, if it is possible (the actual method depends on the data format)
  • Implement InputFormat that understands your data format (or search for one that does it already) and correctly sets number of splits. Then, you can use sparkContext.hadoopFile() method to pass your input format.

Upvotes: 2

Related Questions