LUIS PEREIRA
LUIS PEREIRA

Reputation: 488

What is the best way to merge large files?

I have to merge thousands of large files (~200MB each). I would like to know what is the best way to merge this files. Lines will be conditionally copied to the merged file. Could it by using File.AppendAllLines or using Stream.CopyTo?

Using File.AppendAllLines

for (int i = 0; i < countryFiles.Length; i++){
   string srcFileName = countryFiles[i];
   string[] countryExtractLines = File.ReadAllLines(srcFileName);  
   File.AppendAllLines(actualMergedFileName, countryExtractLines);
}

Using Stream.CopyTo

using (Stream destStream = File.OpenWrite(actualMergedFileName)){
  foreach (string srcFileName in countryFiles){
    using (Stream srcStream = File.OpenRead(srcFileName)){
        srcStream.CopyTo(destStream);
    }
  }
}

Upvotes: 5

Views: 2624

Answers (3)

Matthew Watson
Matthew Watson

Reputation: 109567

Suppose you have a condition which must be true (i.e. a predicate) for each line in one file that you want to append to another file.

You can efficiently process that as follows:

var filteredLines = 
    File.ReadLines("MySourceFileName")
    .Where(line => line.Contains("Target")); // Put your own condition here.

File.AppendAllLines("MyDestinationFileName", filteredLines);

This approach scales to multiple files and avoids loading the entire file into memory.

If instead of appending all the lines to a file, you wanted to replace the contents, you'd do:

File.WriteAllLines("MyDestinationFileName", filteredLines);

instead of

File.AppendAllLines("MyDestinationFileName", filteredLines);

Also note that there are overloads of these methods that allow you to specify the encoding, if you are not using UTF8.

Finally, don't be thrown by the inconsistent method naming.File.ReadLines() does not read all lines into memory, but File.ReadAllLines() does. However, File.WriteAllLines() does NOT buffer all lines into memory, or expect them to all be buffered in memory; it uses IEnumerable<string> for the input.

Upvotes: 2

StepUp
StepUp

Reputation: 38094

You can write the files one after the other. For example:

static void MergingFiles(string outputFile, params string[] inputTxtDocs)
{
    using (Stream outputStream = File.OpenWrite(outputFile))
    {
      foreach (string inputFile in inputTxtDocs)
      {
        using (Stream inputStream = File.OpenRead(inputFile))
        {
          inputStream.CopyTo(outputStream);
        }
      }
    }
}

In my view the above code is really high-performance as Stream.CopyTo() has really very simple algorithm so the method is high effective. The reflector renders the heart of it as follows:

private void InternalCopyTo(Stream destination, int bufferSize)
{
  int num;
  byte[] buffer = new byte[bufferSize];
  while ((num = this.Read(buffer, 0, buffer.Length)) != 0)
  {
     destination.Write(buffer, 0, num);
  }
}

Upvotes: 5

Mike Ledwards
Mike Ledwards

Reputation: 41

sab669's answer is correct, you want to use a StreamReader then loop over each line of the file... I would suggest writing each file individually however as otherwise you are going to run out of memory pretty quickly with many 200mb files

For example:

foreach(File f in files)
{
    List<String> lines = new List<String>();
    string line;
    int cnt = 0;
    using(StreamReader reader = new StreamReader(f)) {
        while((line = reader.ReadLine()) != null) {
            // TODO : Put your conditions in here
            lines.Add(line);
            cnt++;
        }
    }
    f.Close();
    // TODO : Append your lines here using StreamWriter
}

Upvotes: 3

Related Questions