Reputation: 165
I need to compare 2 csv (huge files) row by row and write the difference rows in a separate file. The row in one file can be present anywhere in second file. I need to compare the entire row. Any pointers?
Upvotes: 0
Views: 6978
Reputation: 11301
One common approach is to calculate hash code for each of the rows in one file (preferably the smaller one). Then put entire file into a hashtable. This will be the index of the smaller file.
After that, walk through the larger file. For each row calculate its hash. Then look into the index. If there is no such hash code there, then this row is the difference. Otherwise, if there is such hash code (possibly more than one row will have the same hash there), then perform entire comparison of the source row with all colliding rows in the hash table and see if there is the duplicate.
Now, if there is no duplicate, then the row in the source file is again unique and push it to the output.
Otherwise, if there is a duplicate, you may wish to remove that duplicate from the hash table and skip the input row. That means that the two rows from the two files have been detected as equal and will cancel each other.
When you finish walking through the larger file, you need to decide what to do with the remaining rows in the hashtable. You probably want to push all of them to the output as well, because those were the rows that didn't exist in the other file.
Now I'll try to outline the pseudocode:
dict = new dictionary<code, list<row>>
-- Indexing phase
foreach row in file1
code = hash(row)
if dict.contains(code) then
dict[hash].add(row)
else
dict[hash] = new list(row)
-- Comparison phase
foreach row in file2
code = hash(row)
bool unique = true
if dict.contains(code) then
foreach indexedRow in dict[code]
if indexedRow is the same as row then
begin
unique = false
remove indexedRow from dict[code]
end
if unique then
push row to output
-- Finalization phase
foreach row in dict
push row to output
Greatest quality of this solution is that its run time complexity is O(M + N), where M and N are the number of rows in each of the files. Its drawback is that it takes O(min(M, N)) memory for the index.
Upvotes: 2
Reputation:
Quick and dirty:
private void DoSomething()
{
var lines1 = File.ReadAllLines(@"file1.csv");
var lines2 = File.ReadAllLines(@"file2.csv");
var diff1From2 = FindDifferences(lines1, lines2);
var diff2From1 = FindDifferences(lines2, lines1);
var diffs = new List<string>(diff1From2);
diffs.AddRange(diff2From1);
File.WriteAllLines(@"file3.csv", diffs);
}
private static string[] FindDifferences(string[] linesFirst, string[] linesSecond)
{
return (from line1 in linesFirst
let isLineEqual = linesSecond.Any(line2 => line1 == line2)
where isLineEqual == false
select line1).ToArray();
}
Upvotes: 0