Kayser
Kayser

Reputation: 6694

How do I sort very large files

I have some files that should be sorted according to id at the beginning of each line. The files are about 2-3 gb.

I tried to read all data into an ArrayList and sort them. But memory is not enough to keep them all. It does not work.

Lines look like

0052304 0000004000000000000000000000000000000041 John Teddy 000023
0022024 0000004000000000000000000000000000000041 George Clan 00013

How can I sort the files??

Upvotes: 35

Views: 57385

Answers (10)

R Sun
R Sun

Reputation: 1671

I used my own logic and sorted a BIG JSON file in format

{"name":"hoge.finance","address":"0xfAd45E47083e4607302aa43c65fB3106F1cd7607"}

Full source code is available on https://github.com/sitetester/token-sorter along with test case. Code is well documented, so easy to understand.

It splits input file into multiple smaller SORTED files (configurable) and then compare data.

Pasting some comments here...

// at this point, we have sorted data sets in respective files
// next, we will take first token from first file and compare it with tokens of all other files
// during comparison, if some token from other file is in sorted order, then we make it default/initial sorted token
// & jump to next file, since all remaining tokens in THAT file are already in sorted form
// at end of comparisons with all files, we remove it from specific file (so it's not compared next time) and put/append in final sorted file
// this process continues, until all entries are matched
// if some file has no entries, then we simply delete it (so it's not compared next time)

Upvotes: 0

Dave Moten
Dave Moten

Reputation: 12087

Use the java library big-sorter which can be used for sorting very large text or binary files.

Here's how your exact problem would be implemented:

// write the input to a file
String s = "0052304 0000004000000000000000000000000000000041   John Teddy   000023\n"
        + "0022024 0000004000000000000000000000000000000041   George Clan 00013";
File input = new File("target/input");
Files.write(input.toPath(),s.getBytes(StandardCharsets.UTF_8), StandardOpenOption.WRITE);

File output = new File("target/output");


//sort the input
Sorter
    .serializerLinesUtf8()
    .comparator((a,b) -> {
        String ida = a.substring(0, a.indexOf(' '));
        String idb = b.substring(0, b.indexOf(' '));
        return ida.compareTo(idb);
    }) 
    .input(input) 
    .output(output) 
    .sort();

// display the output
Files.readAllLines(output.toPath()).forEach(System.out::println);

output:

0022024 0000004000000000000000000000000000000041   George Clan 00013
0052304 0000004000000000000000000000000000000041   John Teddy   000023

Upvotes: 3

sapy
sapy

Reputation: 9592

You need to perform an external sort. It's kind the driving idea behind Hadoop/MapReduce , just that it doesn't take distributed cluster into account and works on a single node.

For better performance, you should use Hadoop/Spark.

enter image description here Change this lines according to your system . fpath is your one big input file (tested with 20GB). shared path is where the execution log is stored. fdir is where the intermediate files will be stored and merged. Change these paths according to your machine.

public static final String fdir = "/tmp/";
    public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
    public static final String fPath = "/input/data-20GB.in";
    public static final String opLog = shared+"Mysort20GB.log";

Then run the following programme. Your final sorted file will be created with the name op401 in fdir path. the last line Runtime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog); checks the output is sorted or not . Remove this line if you dont have valsort installed or the input file is not generated using gensort(http://www.ordinal.com/gensort.html) .

Also dont forget to change int totalLines = 200000000; to the total lines in your file. and thread count (int threadCount = 16) should be always in power of 2 and large enough so that (total size * 2 / no of thread) amount of data can reside in memory. Changing Thread count will change the name of final output file. Like for 16, it will be op401, for 32 it will be op501, for 8 it will be op301 etc.

Enjoy.

    import java.io.*;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    import java.util.ArrayList;
    import java.util.Comparator;
    import java.util.stream.Stream;


    class SplitFile extends Thread {
        String fileName;
        int startLine, endLine;

        SplitFile(String fileName, int startLine, int endLine) {
            this.fileName = fileName;
            this.startLine = startLine;
            this.endLine = endLine;
        }

        public static void writeToFile(BufferedWriter writer, String line) {
            try {
                writer.write(line + "\r\n");
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }

        public void run() {
            try {
                BufferedWriter writer = Files.newBufferedWriter(Paths.get(fileName));
                int totalLines = endLine + 1 - startLine;
                Stream<String> chunks =
                        Files.lines(Paths.get(Mysort20GB.fPath))
                                .skip(startLine - 1)
                                .limit(totalLines)
                                .sorted(Comparator.naturalOrder());

                chunks.forEach(line -> {
                    writeToFile(writer, line);
                });
                System.out.println(" Done Writing " + Thread.currentThread().getName());
                writer.close();
            } catch (Exception e) {
                System.out.println(e);
            }
        }
    }

    class MergeFiles extends Thread {
        String file1, file2, file3;
        MergeFiles(String file1, String file2, String file3) {
            this.file1 = file1;
            this.file2 = file2;
            this.file3 = file3;
        }

        public void run() {
            try {
                System.out.println(file1 + " Started Merging " + file2 );
                FileReader fileReader1 = new FileReader(file1);
                FileReader fileReader2 = new FileReader(file2);
                FileWriter writer = new FileWriter(file3);
                BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
                BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
                String line1 = bufferedReader1.readLine();
                String line2 = bufferedReader2.readLine();
                //Merge 2 files based on which string is greater.
                while (line1 != null || line2 != null) {
                    if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
                        writer.write(line2 + "\r\n");
                        line2 = bufferedReader2.readLine();
                    } else {
                        writer.write(line1 + "\r\n");
                        line1 = bufferedReader1.readLine();
                    }
                }
                System.out.println(file1 + " Done Merging " + file2 );
                new File(file1).delete();
                new File(file2).delete();
                writer.close();
            } catch (Exception e) {
                System.out.println(e);
            }
        }
    }

    public class Mysort20GB {
        //public static final String fdir = "/Users/diesel/Desktop/";
        public static final String fdir = "/tmp/";
        public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
        public static final String fPath = "/input/data-20GB.in";
        public static final String opLog = shared+"Mysort20GB.log";

        public static void main(String[] args) throws Exception{
            long startTime = System.nanoTime();
            int threadCount = 16; // Number of threads
            int totalLines = 200000000;
            int linesPerFile = totalLines / threadCount;
            ArrayList<Thread> activeThreads = new ArrayList<Thread>();

            for (int i = 1; i <= threadCount; i++) {
                int startLine = i == 1 ? i : (i - 1) * linesPerFile + 1;
                int endLine = i * linesPerFile;
                SplitFile mapThreads = new SplitFile(fdir + "op" + i, startLine, endLine);
                activeThreads.add(mapThreads);
                mapThreads.start();
            }
            activeThreads.stream().forEach(t -> {
                try {
                    t.join();
                } catch (Exception e) {
                }
            });

            int treeHeight = (int) (Math.log(threadCount) / Math.log(2));

            for (int i = 0; i < treeHeight; i++) {
                ArrayList<Thread> actvThreads = new ArrayList<Thread>();

for (int j = 1, itr = 1; j <= threadCount / (i + 1); j += 2, itr++) {
                    int offset = i * 100;
                    String tempFile1 = fdir + "op" + (j + offset);
                    String tempFile2 = fdir + "op" + ((j + 1) + offset);
                    String opFile = fdir + "op" + (itr + ((i + 1) * 100));

                    MergeFiles reduceThreads =
                            new MergeFiles(tempFile1,tempFile2,opFile);
                    actvThreads.add(reduceThreads);
                    reduceThreads.start();
                }
                actvThreads.stream().forEach(t -> {
                    try {
                        t.join();
                    } catch (Exception e) {
                    }
                });
            }
            long endTime = System.nanoTime();
            double timeTaken = (endTime - startTime)/1e9;
            System.out.println(timeTaken);
            BufferedWriter logFile = new BufferedWriter(new FileWriter(opLog, true));
            logFile.write("Time Taken in seconds:" + timeTaken);
            Runtime.getRuntime().exec("valsort  " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog);
            logFile.close();
        }
    }

Upvotes: 3

rjh
rjh

Reputation: 50294

Since your records are already in flat file text format, you can pipe them into UNIX sort(1) e.g. sort -n -t' ' -k1,1 < input > output. It will automatically chunk the data and perform merge sort using available memory and /tmp. If you need more space than you have memory available, add -T /tmpdir to the command.

It's quite funny that everyone is telling you to download huge C# or Java libraries or implement merge-sort yourself when you can use a tool that is available on every platform and has been around for decades.

Upvotes: 24

Ingo Kegel
Ingo Kegel

Reputation: 47995

You need an external merge sort to do that. Here is a Java implementation of it that sorts very large files.

Upvotes: 19

Rishi Dua
Rishi Dua

Reputation: 2334

Operating systems come with powerful file sorting utility. A simple function that calls a bash script should help.

public static void runScript(final Logger log, final String scriptFile) throws IOException, InterruptedException {
    final String command = scriptFile;
    if (!new File (command).exists() || !new File(command).canRead() || !new File(command).canExecute()) {
        log.log(Level.SEVERE, "Cannot find or read " + command);
        log.log(Level.WARNING, "Make sure the file is executable and you have permissions to execute it. Hint: use \"chmod +x filename\" to make it executable");
        throw new IOException("Cannot find or read " + command);
    }
    final int returncode = Runtime.getRuntime().exec(new String[] {"bash", "-c", command}).waitFor();
    if (returncode!=0) {
        log.log(Level.SEVERE, "The script returned an Error with exit code: " + returncode);
        throw new IOException();
    }

}

Upvotes: 0

user2071703
user2071703

Reputation: 324

You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you.

Advantages: No need to worry about writing the best sorting algorithm.

Disadvantage: You will need disk space, slower processing.

https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files

Upvotes: 2

Peter Lawrey
Peter Lawrey

Reputation: 533520

Instead of loading all the data into memory at once, you could read just the keys and an index to where the line starts (and possibly the length as well) e.g.

class Line {
   int key, length;
   long start;
}

This would use about 40 bytes per line.

Once you have sorted this array, you can use RandomAccessFile to read the lines in the order they appear.

Note: since you will be randomly hitting the disk, instead of using memory this could be very slow. A typical disk takes 8 ms to randomly access data and if you have 10 million lines this will take about one day. (This is absolute worst case) In memory it would take about 10 seconds.

Upvotes: 7

pcalcao
pcalcao

Reputation: 15990

That isn't exactly a Java problem. You need to look into an efficient algorithm for sorting data that isn't completely read into memory. A few adaptations to Merge-Sort can achieve this.

Take a look at this: http://en.wikipedia.org/wiki/Merge_sort

and: http://en.wikipedia.org/wiki/External_sorting

Basically the idea here is to break the file into smaller pieces, sort them (either with merge sort or another method), and then use the Merge from merge-sort to create the new, sorted file.

Upvotes: 40

Woot4Moo
Woot4Moo

Reputation: 24316

What you need to do is to chunk the files in via a stream and process them separately. Then you can merge the files together as they will already be sorted, this is similar to how merge sort works.

The answer from this SO question will be of value: Stream large files

Upvotes: 1

Related Questions