user262102
user262102

Reputation: 193

C# N way merge for external sort

Whats the best way to implement an N way merge for N sorted files?

Lets say I have 9 sorted files with 10 records each? How do I merge these files to create a big file with 90 sorted records?

Upvotes: 7

Views: 5730

Answers (4)

Stefan Savev
Stefan Savev

Reputation: 1359

I would say don't use priority queue, don't use IEnumerable. Both are very slow.

Here is a fast way to sort or merge sorted files in external memory:

http://www.codeproject.com/KB/recipes/fast_external_sort.aspx

Upvotes: 0

Eric Lippert
Eric Lippert

Reputation: 660397

Addressing the comments in the other answer:

If you have an variable number of files, here's what I'd do. This is just a sketch to get the idea across; this code doesn't compile, I've gotten the method names wrong, and so on.

// initialize the data structures
var priorityQueue = new SortedDictionary<Record, Stream>();
var streams = new List<Stream>();
var outStream = null; 
try
{
  // open the streams.
  outStream = OpenOutputStream();
  foreach(var filename in filenames)
    streams.Add(GetFileStream(filename));
  // initialize the priority queue
  foreach(var stream in streams)
  {
    var record = ReadRecord(stream);
    if (record != null)
      priorityQueue.Add(record, stream);
  // the main loop
  while(!priorityQueue.IsEmpty)
  {
     var record = priorityQueue.Smallest;
     var smallestStream = priorityQueue[record];
     WriteRecord(record, outStream);
     priorityQueue.Remove(record);
     var newRecord = ReadRecord(smallestStream);
     if (newRecord != null)
       priorityQueue.Add(newRecord, smallestStream);
  }
}
finally { clean up the streams }

Does that make sense? You just keep on grabbing the smallest thing out of the priority queue and replacing it with the next record in that stream, if there is one. Eventually the queue will be empty and you'll be done.

Upvotes: 6

Mikael Svenson
Mikael Svenson

Reputation: 39695

Strategy might depend on amount of data.

  1. If data will fit in memory you can read all data into a list, sort it, and write it out
  2. If you want to remove duplicates use a HashSet instead of a list
  3. If it will not fit in memory, open all files for reading, compare the first record of each file, and write out the lowest. Then advance the file which you read. Loop over all files until they are all exhausted and written to the new file.
  4. If you want to remove duplicates, do like above, but skip any record equal to the last written.

Here's a code example which reads in N sorted text files and merges them. I have not included duplicate checking, but it should be easy to implement.

First a helper class.

class MergeFile : IEnumerator<string>
{
    private readonly StreamReader _reader;

    public MergeFile(string file)
    {
        _reader = File.OpenText(file);
        Current = _reader.ReadLine();
    }

    public string Current { get; set; }

    public void Dispose()
    {
        _reader.Close();
    }

    public bool MoveNext()
    {
        Current = _reader.ReadLine();
        return Current != null;
    }

    public void Reset()
    {
        throw new NotImplementedException();
    }

    object IEnumerator.Current
    {
        get { return Current; }
    }
}

And then code to read and merge (it should be refactored for clarity in production):

// Get the file names and instantiate our helper class
List<IEnumerator<string>> files = Directory.GetFiles(@"C:\temp\files", "*.txt").Select(file => new MergeFile(file)).Cast<IEnumerator<string>>().ToList();
List<string> result = new List<string>();
IEnumerator<string> next = null;
while (true)
{
    bool done = true;
    // loop over the helpers
    foreach (var mergeFile in files)
    {
        done = false;
        if (next == null || string.Compare(mergeFile.Current, next.Current) < 1)
        {
            next = mergeFile;
        }
    }
    if (done) break;
    result.Add(next.Current);
    if (!next.MoveNext())
    {
        // file is exhausted, dispose and remove from list
        next.Dispose();
        files.Remove(next);
        next = null;
    }
}

Upvotes: 0

Mark Byers
Mark Byers

Reputation: 838826

I'm assuming that there could be a lot more data then you gave in your example. If you can open all the files simultaneously you can use this algorithm:

  • Read the first line from each file, so you have 10 lines in memory, one from each file.
  • Put the lines into a priority queue by sort order.
  • Take the least element (sorted first) out of the priority queue and write to the output file.
  • Read one more line from the corresponding file the line came from and put that into the priority queue.
  • Repeat until all files are read to the end.

Note that you don't have to read all the files into memory at once, so this will work well if you have a reasonable number of large files, but not if you have a lot of small files.

If you have a lot of small files, you should merge them in groups to make a single output file for each group, then repeat the process to merge these new groups.

In C# you can use for example a SortedDictionary to implement the priority queue.

Upvotes: 6

Related Questions