learner
learner

Reputation: 905

Reading a large input files(10gb) through java program

I am working with a 2 large input files of the order of 5gb each.. It is the output of Hadoop map reduce, but as i am not able to do dependency calculations in Map reduce, i am switching to an optimized for loop for final calculations( see my previous question on map reduce design Recursive calculations using Mapreduce

I would like to have suggestion on reading such huge files in java and doing some basic operations, finally i will be writing out the data which will of the order of around 5gb..

I appreciate your help

Upvotes: 2

Views: 2199

Answers (3)

learner
learner

Reputation: 905

My approach,

Configured the map reduce program to use 16 reducers, so the final output consisted of 16 files(part-00000 to part-00015) of 300+ MB, and the keys were sorted in the same order for both the input files.

Now in every stage i read 2 input files(around 600 MB) and did the processing.. So at every stage i had to hold to 600 MB in memory, which the system could manage pretty well.

The program was pretty quick took around 20mins for the complete processing.

Thanks for all the suggestions!, I appreciate your help

Upvotes: 0

Ryan Stewart
Ryan Stewart

Reputation: 128919

It sounds like there wouldn't be much to a simple implementation. Just open an InputStream/Reader for the file, then, in a loop:

  1. Read in one piece of your data
  2. Process the piece of data
  3. Store the result: in memory if you'll have room for the complete dataset, in a database of some sort if not

If your result set will be too large to keep in memory, a simple way to fix that would be to use an H2 database with local file storage.

Upvotes: 1

Olaf
Olaf

Reputation: 6289

If the files have properties as you described, i.e. 100 integer values per key and are 10GB each, you are talking about a very large number of keys, much more than you can feasibly fit into memory. If you can order files before processing, for example using OS sort utility or a MapReduce job with a single reducer, you can read two files simultaneously, do your processing and output result without keeping too much data in memory.

Upvotes: 1

Related Questions