UDITA SINGLA
UDITA SINGLA

Reputation: 13

Comparing two CSV files in linux

I have two CSV files with me in the following format:

File1:

No.1, No.2
983264,72342349
763498,81243970
736493,83740940

File2:

No.1,No.2
"7938493","7364987"
"2153187","7387910"
"736493","83740940"

I need to compare the two files and output the matched,unmatched values. I did it through awk:

#!/bin/bash

awk 'BEGIN {
    FS = OFS = ","
}
if (FNR==1){next}
NR>1 && NR==FNR {
    a[$1];
    next
}
FNR>1 {
    print ($1 in a) ? $1 FS "Match" : $1 FS "In file2 but not in file1"
    delete a[$1]
}
END {
    for (x in a) {
        print x FS "In file1 but not in file2"
    }
}'file1 file2

But the output is:

"7938493",In file2 but not in file1
"2153187",In file2 but not in file1
"8172470",In file2 but not in file1
7938493,In file1 but not in file2
2153187,In file1 but not in file2
8172470,In file1 but not in file2

Can you please tell me where I am going wrong?

Upvotes: 1

Views: 3899

Answers (1)

Tom Fenech
Tom Fenech

Reputation: 74615

Here are some corrections to your script:

BEGIN {
    # FS = OFS = ","
    FS = "[,\"]+"
    OFS = ", "
}
# if (FNR==1){next}
FNR == 1 {next}

# NR>1 && NR==FNR {
NR==FNR {
    a[$1];
    next
}
# FNR>1 {
$2 in a {
    # print ($1 in a) ? $1 FS "Match" : $1 FS "In file2 but not in file1"
    print ($2 in a) ? $2 OFS "Match" : $2 "In file2 but not in file1"
    delete a[$2]
}
END {
    for (x in a) {
        print x, "In file1 but not in file2"
    }
}

This is an awk script, so you can run it like awk -f script.awk file1 file2. Doing so gives these results:

$ awk -f script.awk file1 file2
736493, Match
763498, In file1 but not in file2
983264, In file1 but not in file2

The main problem with your script was that it didn't correctly handle the double quotes around the numbers in file2. I changed the input field separator so that the double quotes are treated as part of the separator to deal with this. As a result, the first field $1 in the second file is empty (it is the bit between the start of the line and the first "), so you need to use $2 to refer to the first value you're interested in. Aside from that, I removed some redundant conditions from your other blocks and used OFS rather than FS in your first print statement.

Upvotes: 2

Related Questions