Pseudo
Pseudo

Reputation: 57

Proper handling of Gigantic input & processing strings

I've been looking for days on solving some key problems I'm running into, and I have not found a good answer for this problem yet.

I'm embarking on an academic (/learning) project that involves reading 3-50MB plain-text files on a regular basis, and eventually across millions of records (my current set is ~800,000 records)

Assuming the file can't be split() into chunks, what's the best way to pass this this chunk between functions? Pass-by-value leads me to think (and, I believe, see) passing a 50MB file to a function, and returning a 20-30MB result set, means I have used wasted over 100MB memory just passing the file that's waiting to be reclaimed at GC. (Technically, the file can be split(), but those split()s are each 10MB large at times, and each must be held while processing)

I've made significant changes to my overall project recently, and I want to design the processing portion better this time. Previous method primarily read and processed the data in the driver itself--without a data container. When I attempted to use a data container, I ended up with similar results. Here's the first method I used:

  1. Read entire 3-50 MB+ file to String
  2. Regex/split into 4-15 chunks (determined by XML-like tags in file)
  3. Pass 1-3 chunks to function A (Looking for certain data)
  4. Pass 4-5 more chunks to function B (Looking for different data, which won't exist in Function A chunks)
  5. Collect results back in driver function
  6. Stitched together result set, and wrote to disk (I know now that I should create-and-append instead)

I can probably split as I read, however, even those splits can be 5MB in size each (or more), and I need to keep most of them in memory until the file is done with processing (in case step 3 changes how step 4 works).. and even worse, some input readLine()'s might be 1-2MB long themselves (before the \n).

So, what kind of design strategy would be best for handling these huge input files, and huge strings?

Upvotes: 0

Views: 54

Answers (1)

The Guy with The Hat
The Guy with The Hat

Reputation: 11132

Pass-by-value leads me to think (and, I believe, see) passing a 50MB file to a function, and returning a 20-30MB result set, means I have used wasted over 100MB memory just passing the file that's waiting to be reclaimed at GC.

Incorrect. Java passes references by value, not the entire String. What I would do is pass the (reference to) the string along with the start and end indices of the section of the string you want to process.

void read()
{
    String input = /*your code here*/;
    process(input, 37, 17576);
}

process(String input, int startIndex, int endIndex)
{
    /*your code here, e.g.
    for(int i = startIndex; i < endIndex; i++)
    {
        //do stuff
    }*/
}

Also, if read and process are in the same class, you can just make the string a class field:

String input;

void read()
{
    input = /*your code here*/;
    process(37, 17576);
}

process(int startIndex, int endIndex)
{
    /*your code here, e.g.
    for(int i = startIndex; i < endIndex; i++)
    {
        //do stuff
    }*/
}

Upvotes: 2

Related Questions