How sort N files

Question

Following this answer -->

I need only the Merge function on N already sorted files on disk , I want to sort them into one Big file my limitation is the memory Not more than K lines in the memory (K < N) so i cannot fetch all them and then sort, preferred with java

so far I Tried as the code below , but I need a good way to iterate over all N of files line by line (not more than K LINES in memory) + store to disk the sorted final file

       public void run() {
            try {
                System.out.println(file1 + " Started Merging " + file2 );
                FileReader fileReader1 = new FileReader(file1);
                FileReader fileReader2 = new FileReader(file2);

                //......TODO with N ?? ......

                FileWriter writer = new FileWriter(file3);
                BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
                BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
                String line1 = bufferedReader1.readLine();
                String line2 = bufferedReader2.readLine();
                //Merge 2 files based on which string is greater.
                while (line1 != null || line2 != null) {
                    if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
                        writer.write(line2 + "
");
                        line2 = bufferedReader2.readLine();
                    } else {
                        writer.write(line1 + "
");
                        line1 = bufferedReader1.readLine();
                    }
                }
                System.out.println(file1 + " Done Merging " + file2 );
                new File(file1).delete();
                new File(file2).delete();
                writer.close();
            } catch (Exception e) {
                System.out.println(e);
            }
        }

regards,

Holger · Accepted Answer

You can use something like this

public static void mergeFiles(String target, String... input) throws IOException {
    String lineBreak = System.getProperty("line.separator");
    PriorityQueue> lines
        = new PriorityQueue<>(Map.Entry.comparingByKey());
    try(FileWriter fw = new FileWriter(target)) {
        String header = null;
        for(String file: input) {
            BufferedReader br = new BufferedReader(new FileReader(file));
            String line = br.readLine();
            if(line == null) br.close();
            else {
                if(header == null) fw.append(header = line).write(lineBreak);
                line = br.readLine();
                if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
                else br.close();
            }
        }
        for(;;) {
            Map.Entry next = lines.poll();
            if(next == null) break;
            fw.append(next.getKey()).write(lineBreak);
            final BufferedReader br = next.getValue();
            String line = br.readLine();
            if(line != null) lines.add(new AbstractMap.SimpleImmutableEntry<>(line, br));
            else br.close();
        }
    }
    catch(Throwable t) {
        for(Map.Entry br: lines) try {
            br.getValue().close();
        } catch(Throwable next) {
            if(t != next) t.addSuppressed(next);
        }
    }
}

Note that this code, unlike the code in your question, handles the header line. Like the original code, it will delete the input lines. If that’s not intended, you can remove the DELETE_ON_CLOSE option and simplify the entire reader construction to
BufferedReader br = new BufferedReader(new FileReader(file));

It has exactly as much lines in memory, as you have files.

While in principle, it is possible to hold less line strings in memory, to re-read them when needed, it would be a performance disaster for a questionable little saving. E.g. you have already N strings in memory when calling this method, due to the fact that you have N file names.

However, when you want to reduce the number of lines held at the same time, at all costs, you can simply use the method shown in your question. Merge the first two files into a temporary file, merge that temporary file with the third to another temporary file, and so on, until merging the temporary file with the last input file to the final result. Then you have at most two line strings in memory (K == 2), saving less memory than the operating system will use for buffering, trying to mitigate the horrible performance of this approach.

Likewise, you can use the method shown above to merge K files into a temporary file, then merge the temporary file with the next K-1 file, and so on, until merging the temporary file with the remaining K-1 or less files to the final result, to have a memory consumption scaling with K < N. This approach allows to tune K to have a reasonable ratio to N, to trade memory for speed. I think, in most practical cases, K == N will work just fine.

How sort N files

Answers (2)

Related Questions