Reputation: 166

Efficient way to find the differences in 2 csv files

I have 2 large ( say 300,000 * 100) rectangular csv files which have same number of corresponding rows and columns. I need to find the differences if any between each corresponding cell in the 2 files. Program should o/p the row and cell number that are different and the contents that are different.

Since the number of rows/columns is very large , I'm looking for the most efficient way to do this.

Initially I started exploring with awk , and this looked promising , but I haven't worked with awk before and was unsuccessful in extending this to 100 columns instead of 2 as in the example

Next , I tried a sort of brute force approach using Java - Load the files into 2 2-D arrays. Initialize 100 threads, each working on a given column , and when a difference is found, each thread puts in the row,cell and diff values into a HashMap (with column number as Key). I did try optimizing it by comparing while reading the second file into the array , but really since I'm visiting each cell, there is no way it can be fast (Took almost 8 hours to complete the comparison)

I'm ok with either awk or Java. And open to any other entirely different approach.

Upvotes: 1

Answers (3)

Jeronimo Backes

Reputation: 6289

univocity-parsers' CSV parser won't take much longer than 5 seconds to process this:

public void diff(File leftInput, File rightInput) {
    CsvParserSettings settings = new CsvParserSettings(); //many config options here, check the tutorial

    CsvParser leftParser = new CsvParser(settings);
    CsvParser rightParser = new CsvParser(settings);

    leftParser.beginParsing(leftInput);
    rightParser.beginParsing(rightInput);

    String[] left;
    String[] right;

    int row = 0;
    while ((left = leftParser.parseNext()) != null && (right = rightParser.parseNext()) != null) {
        row++;
        if (!Arrays.equals(left, right)) {
            System.out.println(row + ":\t" + Arrays.toString(left) + " != " + Arrays.toString(right));
        }
    }

    leftParser.stopParsing();
    rightParser.stopParsing();
}

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Upvotes: 1

karakfa

Reputation: 67487

awk to the rescue!

It's a no-brainer to do this in awk

$ paste -d, file.1 file.2 | 
  awk -F, '{m=NF/2; for(i=1;i<=m;i++) if($i!=$(i+m)) print NR,i,$i,$(i+m)}'

prints "row# column# left right" values of the unequal cells.

If you want to print a key column in addition to the row number, you can add it easily

$ paste -d, file.1 file.2 | 
  awk -F, -v key=8 '{m=NF/2; 
                     for(i=1;i<=m;i++) 
                       if($i!=$(i+m)) print $key,NR,i,$i,$(i+m)}'

Upvotes: 4

Ed Morton

Reputation: 203483

Something to consider:

$ cat file1
1,2,aa
1,2,3
1,bb,3
1,2,3

$ cat file2
1,2,cc
1,2,3
1,dd,3
1,2,3

$ diff file1 file2 |
awk -F, '
    /^[0-9]/ { row=$0+0; next }
    sub(/^< /,"") { split($0,a); next }
    sub(/^> /,"") { for (col=1;col<=NF;col++) if ($col != a[col]) print row, col, a[col], $col }
'
1 3 aa cc
3 2 bb dd

That should be extremely fast since it only executes awk and does the loop on the lines that have differences, not all lines.

Upvotes: 1

Efficient way to find the differences in 2 csv files

Answers (3)

Related Questions