Phoenix
Phoenix

Reputation: 8923

How do I merge individual files into a single sorted file?

I've a bunch of id's in 4 files a,b,c,d saved in my workspace. I want to merge all these id's in a sorted order in a single file merged.txt .. They will be saved one per line as a string. I could individually sort files by bringing them into memory. But how do I merge them, there can be duplicate entries. I can't think of how to compare each entry in the four files (they can grow to 8 so can't hardcode this). Particularly how do I compare the entries and how do I advance only those file pointers that are the smallest of the entries ?

public void sortFile() throws IOException
{
    File a = new File("/Users/phoenix/workspace/data/a.txt");
    File b = new File("/Users/phoenix/workspace/data/b.txt");
    File c = new File("/Users/phoenix/workspace/data/c.txt");
    File d = new File("/Users/phoenix/workspace/data/d.txt");

    doSort(a);
    doSort(b);
    doSort(c);
    doSort(d);

    merge();
}

How do I modify the merge method according to pseudocode below ?

public void merge()
{
    File dir = new File("/Users/phoenix/workspace/data");

    for(File f: dir.listFiles())
    {
        // toDo: merge into a single file merged.txt
    }
}

public void doSort(File f) throws IOException
{
    BufferedReader reader = new BufferedReader(new FileReader(f));
    String line;
    ArrayList<String> list = new ArrayList<String>();
    while((line = reader.readLine())!=null)
    {
        list.add(line);
    }

    Collections.sort(list);

    PrintWriter out = new PrintWriter(f);

    for(String s:list)
    out.println(s);

    reader.close();
    out.close();
}


    public void merge() throws IOException
    {
        File dir = new File("/Users/phoenix/workspace/data");
        File merged = new File("/Users/phoenix/workspace/data/merged.txt");

        ArrayList<BufferedReader> readers = new ArrayList<BufferedReader>(dir.listFiles().length);
        ArrayList<String> list = new ArrayList<String>();
        PrintWriter out = new PrintWriter(merged);

        for(File f: dir.listFiles())
        {
            readers.add(new BufferedReader(new FileReader(f)));
        }

        while(true)
        {
        for (BufferedReader reader: readers)
        {
            if(reader.readLine()!=null)
                list.add(reader.readLine());

            else
            {
                reader.close();
            }

        }

        String min = Collections.min(list);
        int index = list.indexOf(min);
        out.write(min);
    }



 }

Upvotes: 2

Views: 1864

Answers (3)

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726589

Here is a general description of the algorithm:

  1. Open each file, and read its first item.
  2. Go through all files, and select one file with the smallest current item; if multiple files have the same item, pick the first file that has the item
  3. Remove the smallest item from the file that you identified in step 2, and write it to the output file
  4. If the file from which you removed the item has ended, close the file, and remove it from the list of files
  5. If the list of files is not empty, go back to step 2.

Before going through with the algorithm your code needs to check that at least one input file is present; otherwise, your code should exit.

EDIT : Your merge code does not look much like the algorithm above; here is some code to help you get started:

// Prepare your readers and their top items
for(File f: dir.listFiles()) {
    BufferedReader br = new BufferedReader(new FileReader(f));
    String firstLine = reader.readLine();
    // Your code inserts buffered readers unconditionally;
    // You should not insert readers for empty files.
    if (firstLine != null) {
        readers.add(br);
        list.add(firstLine);
    } else {
        br.close();
    }
}
// Stop when the last reader is removed
while (!readers.isEmpty()) {
    int minIndex = ... // Find the index of the smallest item in the "list"
    out.write(list.get(minIndex));
    BufferedReader br = readers.get(minIndex);
    String next = br.readLine();
    if (next != null) {
        list.set(minIndex, next);
    } else {
        br.close();
        list.remove(minIndex);
        readers.remove(minIndex);
    }
}

Upvotes: 1

chamakits
chamakits

Reputation: 1935

Are you looking to solve your problem, or solve it in Java.

If you are simply looking for ways to do this, and have access to the terminal, and by "sorting" you mean sorting alphabetically, you can do it simpler.

cat "/Users/phoenix/workspace/data/a.txt" "/Users/phoenix/workspace/data/b.txt" "/Users/phoenix/workspace/data/c.txt" "/Users/phoenix/workspace/data/d.txt"|sort > merged.txt

For sorting and only picking up the uniq ones

cat "/Users/phoenix/workspace/data/a.txt" "/Users/phoenix/workspace/data/b.txt" "/Users/phoenix/workspace/data/c.txt" "/Users/phoenix/workspace/data/d.txt"|sort |uniq > merged.txt

Update: BTW, to sort numerically, use

sort -n

Upvotes: 2

Evgeniy Dorofeev
Evgeniy Dorofeev

Reputation: 136022

Read each file into a List

List<String> list1 = Files.readAllLines(Path.get(path), StandardCharsets.UTF_8);
...

merge lists1 in one list

List<String> list = new ArrayList<>();
list.addAll(list1);
...

now sort the lines

Collections.sort(list);

and write them into a single file.

Note: If you dont want duplicate lines use TreeSet instead of ArrayList

Upvotes: 0

Related Questions