Reputation: 3647
I have terabytes of data to process using Apache Spark. I use the code sparkContext.binaryFiles(folderpath)
to load all data from the folder.
I think it loads the full data to RDD and cause OutOfMemory error.
How can I split the 1TB data to 250GBs and Let the RDD loads it?
Upvotes: 1
Views: 164
Reputation: 5354
Unfortunately, binaryFiles
loads every file as one entry in RDD. I assume that you have all the data in one file or just a couple of them.
Basically, you have two options:
InputFormat
that understands your data format (or search for one that does it already) and correctly sets number of splits. Then, you can use sparkContext.hadoopFile()
method to pass your input format.Upvotes: 2