How do I compare two files in unix based on their columns

Question

I am fairly new to unix commands, but i have two .csv files where i would like to compare the first column either with diff or comm. Every line is different, if i were to compare the whole line, thats why i want to compare the first column in each file and then have the difference printed out in numbers where the landcode sould not be counted more than once. The first file has also has a header i want to skip when it compares.

sample from file1:

 iso_code,continent,location,date,total_cases 
 AND,Denver ,America,2020-07-26,897.0
 ABW,Copenhagen Denmark,,2020-03-13,2.0
 AFG,Oslo,Norway,2020-09-06,324.0
 AZE,Hamburg,Germany,2020-03-30,29.0

sample from file2:

AND,Denver ,America,2020-07-26,897.0
ABW,Copenhagen Denmark,,2020-03-13,5.0
ABW,Chil Ukrain,Aruba,2020-10-06,4449.0
ALB,Upsala,Sweden,2020-08-275.0,
AFG,Afghanistan,,2020-09-06,324.0

The expected output should be "2", as there are two occurrences of the same land code in the two files. Duplicates of the contry code sould only be counted one time. That is why expected out should be 2 and not 3

I have tried multiple solutions:

awk 'NR==FNR{c[$1]++;next};c[$1] == 0' owid-covid-data-filtered.csv owid-covid-data.csv | wc -l

with the awk i get output: 1

and

 diff owid-covid-data.csv owid-covid-data-filtered.csv |cut -d' ' -f1 owid-covid-data-filtered.csv| wc -l

overall i want the occurrences that are similar in both file1 and file2 column 1

Bodo · Accepted Answer

From the condition c[$1] == 0 in the awk script from the question I assumed you want to print lines from file2 that contain a code that is not present in file1.

As it is clarified now, that you want to count the codes that are present in both files, see below at the end of the answer for the reverse check.

Slight modifications to your script will fix the problems:

awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1]++ == 0' file1 file2

Option -F , specifies comma (,) as field separator.

The condition if(NR!=1)c[$1]++; skips the header line in file1.

The post-increment operator in c[$1]++ == 0 will make the condition fail for the second or later occurrence of the same code in file2.

I omit the trailing | wc -l here to show the output lines.

I modified file2 to contain two lines with the same code in column 1 that is not present in file1.

With file2 shown here

AND,Europe,Andorra,2020-07-26,897.0
ABW,North America,Aruba,2020-03-13,2.0
ABW,North America,Aruba,2020-10-06,4079.0
ALB,Europe,Albania,2020-08-23,8275.1
ALB,Europe,Albania,2020-08-23,8275.2
AFG,Asia,Afghanistan,2020-09-06,38324.0
AFG,Asia,Afghanistan,2020-09-06,38324.0

and file1 from the question I get this output:

AND,Europe,Andorra,2020-07-26,897.0
ALB,Europe,Albania,2020-08-23,8275.1

(Only the first line with ALB is printed`.)

You can also implemente the counting in awk instead of using wc -l.

awk -F , 'NR==FNR { if(NR!=1)c[$1]++; next } c[$1]++ == 0 {count++} END {print count}' file1 file2

If you want to print the lines from file2 that contain a code that is present in file1, the script can be modified like this:

awk -F, 'NR==FNR { if(NR!=1)c[$1]++; next} c[$1] { c[$1]=0; print}' file1 file2

This prints

ABW,North America,Aruba,2020-03-13,2.0
AFG,Asia,Afghanistan,2020-09-06,38324.0

(The first line with code ABW.)

Alternative solution as requested in a comment.

tail -n +2 file1|cut -f1 -d,|sort -u>code1
cut -f1 -d, file2|sort -u>code2
fgrep -vf code1 code2
rm code1 code2

Or combined in one command without using temporary files code1 and code2:

fgrep -f <(tail -n +2 file1|cut -f1 -d,|sort -u) <(cut -f1 -d, file2|sort -u)

Add | wc -l to count the lines instead of printing them.

Explanation:

tail -n +2 print everything starting from the 2nd line
cut -f1 -d, print the first field, delimited with ,
sort -u sort lines and remove duplicates
fgrep -f code1 code2 print all lines from code2 that contain any of the strings from code1

How do I compare two files in unix based on their columns

Answers (2)

Related Questions