Reputation: 905
I am working with a 2 large input files of the order of 5gb each.. It is the output of Hadoop map reduce, but as i am not able to do dependency calculations in Map reduce, i am switching to an optimized for loop for final calculations( see my previous question on map reduce design Recursive calculations using Mapreduce
I would like to have suggestion on reading such huge files in java and doing some basic operations, finally i will be writing out the data which will of the order of around 5gb..
I appreciate your help
Upvotes: 2
Views: 2199
Reputation: 905
My approach,
Configured the map reduce program to use 16 reducers, so the final output consisted of 16 files(part-00000 to part-00015) of 300+ MB, and the keys were sorted in the same order for both the input files.
Now in every stage i read 2 input files(around 600 MB) and did the processing.. So at every stage i had to hold to 600 MB in memory, which the system could manage pretty well.
The program was pretty quick took around 20mins for the complete processing.
Thanks for all the suggestions!, I appreciate your help
Upvotes: 0
Reputation: 128919
It sounds like there wouldn't be much to a simple implementation. Just open an InputStream/Reader for the file, then, in a loop:
If your result set will be too large to keep in memory, a simple way to fix that would be to use an H2 database with local file storage.
Upvotes: 1
Reputation: 6289
If the files have properties as you described, i.e. 100 integer values per key and are 10GB each, you are talking about a very large number of keys, much more than you can feasibly fit into memory. If you can order files before processing, for example using OS sort utility or a MapReduce job with a single reducer, you can read two files simultaneously, do your processing and output result without keeping too much data in memory.
Upvotes: 1