Aruna
Aruna

Reputation: 165

How to compare two large CSV files and get the difference file

I need to compare 2 csv (huge files) row by row and write the difference rows in a separate file. The row in one file can be present anywhere in second file. I need to compare the entire row. Any pointers?

Upvotes: 0

Views: 6978

Answers (2)

Zoran Horvat
Zoran Horvat

Reputation: 11301

One common approach is to calculate hash code for each of the rows in one file (preferably the smaller one). Then put entire file into a hashtable. This will be the index of the smaller file.

After that, walk through the larger file. For each row calculate its hash. Then look into the index. If there is no such hash code there, then this row is the difference. Otherwise, if there is such hash code (possibly more than one row will have the same hash there), then perform entire comparison of the source row with all colliding rows in the hash table and see if there is the duplicate.

Now, if there is no duplicate, then the row in the source file is again unique and push it to the output.

Otherwise, if there is a duplicate, you may wish to remove that duplicate from the hash table and skip the input row. That means that the two rows from the two files have been detected as equal and will cancel each other.

When you finish walking through the larger file, you need to decide what to do with the remaining rows in the hashtable. You probably want to push all of them to the output as well, because those were the rows that didn't exist in the other file.

Now I'll try to outline the pseudocode:

dict = new dictionary<code, list<row>>

-- Indexing phase
foreach row in file1
    code = hash(row)
    if dict.contains(code) then
        dict[hash].add(row)
    else
        dict[hash] = new list(row)

-- Comparison phase
foreach row in file2
    code = hash(row)
    bool unique = true
    if dict.contains(code) then
        foreach indexedRow in dict[code]
            if indexedRow is the same as row then
                begin
                    unique = false
                    remove indexedRow from dict[code]
                end
    if unique then
        push row to output

-- Finalization phase
foreach row in dict
    push row to output

Greatest quality of this solution is that its run time complexity is O(M + N), where M and N are the number of rows in each of the files. Its drawback is that it takes O(min(M, N)) memory for the index.

Upvotes: 2

user6522773
user6522773

Reputation:

Quick and dirty:

private void DoSomething()
{
  var lines1 = File.ReadAllLines(@"file1.csv");
  var lines2 = File.ReadAllLines(@"file2.csv");

  var diff1From2 = FindDifferences(lines1, lines2);
  var diff2From1 = FindDifferences(lines2, lines1);

  var diffs = new List<string>(diff1From2);
  diffs.AddRange(diff2From1);
  File.WriteAllLines(@"file3.csv", diffs);
}

private static string[] FindDifferences(string[] linesFirst, string[] linesSecond)
{
  return (from line1 in linesFirst
    let isLineEqual = linesSecond.Any(line2 => line1 == line2)
    where isLineEqual == false
    select line1).ToArray();
}

Upvotes: 0

Related Questions