Pyrodancer123
Pyrodancer123

Reputation: 129

Count number of identical lines in two files using awk

I have a long series of files. Some of them have lines in common. I am trying to use awk to find the lines that are different between two files and then print that number to a variable for use outside of awk.

Here is what my awk code currently looks like:

awk 'NR==FNR{a[$1FS$2]=$0;next}  {print  (!a[$1FS$2]?$0:"")}' C6H6_1651.com  C6H6_1652.com  | awk 'END { print NR }'

What I get out is 32 which is the number of lines in each of those files. I know from looking at those files that the desired output should be 2, as there are only two lines that are different between the two files.

Other arrangements of these awk commands that I have tried are:

awk 'NR==FNR{!a[$1FS$2]?$0:"";next}  END { print NR }' C6H6_1651.com C6H6_1652.com

which outputs 64

awk 'NR==FNR{a[$1FS$2]=$0;next}  {print  (!a[$1FS$2]?$0:"")} END { printf NR }' C6H6_1651.com C6H6_1652.com

which outputs a line for every line in the document but the only lines that contain text are the ones that don't match between the two files. 64 then follows up this block of text.

Here are the contents of C6H6_1651.com

%chk=C6H6_1651.chk
%nproc=20
# mp2/cc-pVTZ

C6H6_1651

0 1
 C            0.000000000     1.394800000     0.000000000 
 C            0.000000000    -1.394800000     0.000000000 
 C            1.207900000     0.697400000     0.000000000 
 C           -1.207900000     0.697400000     0.000000000 
 C           -1.207900000    -0.697400000     0.000000000 
 C            1.207900000    -0.697400000     0.000000000 
 C            0.000000000     1.394800000     3.000000000 
 C            0.000000000    -1.394800000     3.000000000 
 C            1.207900000     0.697400000     3.000000000 
 C           -1.207900000     0.697400000     3.000000000 
 C           -1.207900000    -0.697400000     3.000000000 
 C            1.207900000    -0.697400000     3.000000000 
 H            0.000000000     2.482200000     0.000000000 
 H            2.149700000     1.241100000     0.000000000 
 H           -2.149700000     1.241100000     0.000000000 
 H           -2.149700000    -1.241100000     0.000000000 
 H            2.149700000    -1.241100000     0.000000000 
 H            0.000000000    -2.482200000     0.000000000 
 H            0.000000000     2.482200000     3.000000000 
 H            2.149700000     1.241100000     3.000000000 
 H           -2.149700000     1.241100000     3.000000000 
 H           -2.149700000    -1.241100000     3.000000000 
 H            2.149700000    -1.241100000     3.000000000 
 H            0.000000000    -2.482200000     3.000000000 

Here are the contents of C6H6_1652.com

%chk=C6H6_1652.chk
%nproc=20
# mp2/cc-pVTZ

C6H6_1652

0 1
 C            0.000000000     1.394800000     0.000000000 
 C            0.000000000    -1.394800000     0.000000000 
 C            1.207900000     0.697400000     0.000000000 
 C           -1.207900000     0.697400000     0.000000000 
 C           -1.207900000    -0.697400000     0.000000000 
 C            1.207900000    -0.697400000     0.000000000 
 C            0.000000000     1.394800000     3.000000000 
 C            0.000000000    -1.394800000     3.000000000 
 C            1.207900000     0.697400000     3.000000000 
 C           -1.207900000     0.697400000     3.000000000 
 C           -1.207900000    -0.697400000     3.000000000 
 C            1.207900000    -0.697400000     3.000000000 
 H            0.000000000     2.482200000     0.000000000 
 H            2.149700000     1.241100000     0.000000000 
 H           -2.149700000     1.241100000     0.000000000 
 H           -2.149700000    -1.241100000     0.000000000 
 H            2.149700000    -1.241100000     0.000000000 
 H            0.000000000    -2.482200000     0.000000000 
 H            0.000000000     2.482200000     3.000000000 
 H            2.149700000     1.241100000     3.000000000 
 H           -2.149700000     1.241100000     3.000000000 
 H           -2.149700000    -1.241100000     3.000000000 
 H            2.149700000    -1.241100000     3.000000000 
 H            0.000000000    -2.482200000     3.000000000 

Upvotes: 0

Views: 139

Answers (1)

RavinderSingh13
RavinderSingh13

Reputation: 133710

In case you want to do this in awk try. Following will show lines which are present in both files.

awk '
FNR==NR{
  array[$0]
  next
} 
($0 in array)
' Input_file1  Input_file2

OR to get number of lines in awk itself try:

awk '
FNR==NR{
  array[$0]
  next
} 
($0 in array){
  count++
}
END{
  print "Total matching lines are:" count
}
' Input_file1  Input_file2

To know lines which are present in file1 and not in file2 try:

awk '
FNR==NR{
  array[$0]
  next
} 
!($0 in array)
' Input_file1  Input_file2

OR

awk '
FNR==NR{
  array[$0]
  next
} 
!($0 in array){
  count++
}
END{
  print "Total lines found in file1 and NOT in file2 are:"count
}
' Input_file1  Input_file2

To get lines which are present in file2 and not in file1 try:

awk '
FNR==NR{
  array[$0]
  next
} 
!($0 in array)
' Input_file2  Input_file1

OR

awk '
FNR==NR{
  array[$0]
  next
} 
!($0 in array){
  count++
}
END{
  print "Total lines found in file2 and NOT in file1 are:"count
}
' Input_file2  Input_file1

Above solutions(without END block one) will print lines in case you need to know only number of lines append | wc -l to above commands.

Upvotes: 2

Related Questions