Reputation: 166
I have 2 large ( say 300,000 * 100) rectangular csv files which have same number of corresponding rows and columns. I need to find the differences if any between each corresponding cell in the 2 files. Program should o/p the row and cell number that are different and the contents that are different.
Since the number of rows/columns is very large , I'm looking for the most efficient way to do this.
Initially I started exploring with awk , and this looked promising , but I haven't worked with awk before and was unsuccessful in extending this to 100 columns instead of 2 as in the example
Next , I tried a sort of brute force approach using Java - Load the files into 2 2-D arrays. Initialize 100 threads, each working on a given column , and when a difference is found, each thread puts in the row,cell and diff values into a HashMap (with column number as Key). I did try optimizing it by comparing while reading the second file into the array , but really since I'm visiting each cell, there is no way it can be fast (Took almost 8 hours to complete the comparison)
I'm ok with either awk or Java. And open to any other entirely different approach.
Upvotes: 1
Views: 2871
Reputation: 6289
univocity-parsers' CSV parser won't take much longer than 5 seconds to process this:
public void diff(File leftInput, File rightInput) {
CsvParserSettings settings = new CsvParserSettings(); //many config options here, check the tutorial
CsvParser leftParser = new CsvParser(settings);
CsvParser rightParser = new CsvParser(settings);
leftParser.beginParsing(leftInput);
rightParser.beginParsing(rightInput);
String[] left;
String[] right;
int row = 0;
while ((left = leftParser.parseNext()) != null && (right = rightParser.parseNext()) != null) {
row++;
if (!Arrays.equals(left, right)) {
System.out.println(row + ":\t" + Arrays.toString(left) + " != " + Arrays.toString(right));
}
}
leftParser.stopParsing();
rightParser.stopParsing();
}
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
Upvotes: 1
Reputation: 67487
awk
to the rescue!
It's a no-brainer to do this in awk
$ paste -d, file.1 file.2 |
awk -F, '{m=NF/2; for(i=1;i<=m;i++) if($i!=$(i+m)) print NR,i,$i,$(i+m)}'
prints "row# column# left right" values of the unequal cells.
If you want to print a key column in addition to the row number, you can add it easily
$ paste -d, file.1 file.2 |
awk -F, -v key=8 '{m=NF/2;
for(i=1;i<=m;i++)
if($i!=$(i+m)) print $key,NR,i,$i,$(i+m)}'
Upvotes: 4
Reputation: 203483
Something to consider:
$ cat file1
1,2,aa
1,2,3
1,bb,3
1,2,3
$ cat file2
1,2,cc
1,2,3
1,dd,3
1,2,3
$ diff file1 file2 |
awk -F, '
/^[0-9]/ { row=$0+0; next }
sub(/^< /,"") { split($0,a); next }
sub(/^> /,"") { for (col=1;col<=NF;col++) if ($col != a[col]) print row, col, a[col], $col }
'
1 3 aa cc
3 2 bb dd
That should be extremely fast since it only executes awk and does the loop on the lines that have differences, not all lines.
Upvotes: 1