test1 1
test1 1

Reputation: 47

Compare two files & output differences (including Line Number and content) from both files

I am attempting to get the differences of both files, line number and content either in another file or stdout. I have attempted the below, yet, not able to get the exact desired output. Please see below.

File contents:

File1:

Col1,Col2,Col3
Text1,text1,text1
Text2,text2,Rubbish

File2:

Col1,Col2,Col3
Text1,text1,text1
Text2,text2,text2
Text3,text3,text3

I have tried the following code which does not provide the exact desired output as it only shows the difference in the first file and not the extra line in file2.

sort file1 file2 | uniq | awk 'FNR==NR{ a[$1]; next } !($1 in a) {print FNR": "$0}' file2 file1

Output

3: Text2,text2,Rubbish

Desired Output

3: Text2,text2,Rubbish (File1)
3: Text2,text2,text2 (File2)
4: Text3,text3,text3 (File2)

I DONOT wish to use diff/sdiff/comm because of the outputs, as I cannot add line number and organise the data side by side for ease of reading. Normal files would be in excess of 1000 lines so diff/sdiff utilities become more difficult to read.

Upvotes: 3

Views: 576

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133518

With your shown samples, please try following awk code. Written and tested in GNU awk.

awk '
BEGIN { OFS=": " }
FNR==1{ next     }
FNR==NR{
  arr[$0]=FNR
  next
}
!($0 in arr){
  print FNR,$0" ("FILENAME")"
  next
}
{
  arr1[$0]
}
END{
  for(key in arr){
    if(!(key in arr1)){
      print arr[key],key" ("ARGV[1]")"
    }
  }
}
' file1 file2

Explanation: Adding detailed explanation for above.

awk '                                   ##Starting awk program from here.
BEGIN { OFS=": " }                      ##Setting OFS to colon space in BEGIN section of this program.
FNR==1{ next     }                      ##Skipping if there is FNR==1 for both the files.
FNR==NR{                                ##Checking condition if FNR==NR then do following.
  arr[$0]=FNR                           ##Creating arr with index of current line has FNR as value.
  next                                  ##Will skip all further statements from here.
}
!($0 in arr){                           ##If current line is NOT in arr(to get lines which are in file2 but not in file1)
  print FNR,$0" ("FILENAME")"           ##Printing as per OP request number with file name, line.
  next                                  ##Will skip all further statements from here.
}
{
  arr1[$0]                              ##Creating arr1 which has index as current line in it.
}
END{                                    ##Starting END section of this program from here.
  for(key in arr){                      ##Traversing through arr here.
    if(!(key in arr1)){                 ##If key is NOT present in arr1.
      print arr[key],key" ("ARGV[1]")"   ##Printing values of arr and first file name, basically getting lines which are present in file1 and NOT in file2.
    }
  }
}
' file1 file2                           ##Mentioning Input_file names here.

Upvotes: 3

jhnc
jhnc

Reputation: 16752

You can get your desired output using GNU diff + awk:

funkydiff(){
   # configure diff to output:
   #    - 1 or 2 as flag character to identify file
   #    - space
   #    - line number formatted as integer
   #    - colon, space
   #    - the line content including terminating newline
   diff -d \
       --old-line-format='1 %dn: %L' \
       --new-line-format='2 %dn: %L' \
       --unchanged-line-format='' \
       -- "$1" "$2" \
   | awk '
      # p starts off null
      # initialise filename lookup table
      !p { f[1]=a; f[2]=b }

      {
         # discard first 2 characters of input line
         # append bracketed filename if flag character changed
         # print the result
         print substr($0,3) ($1!=p?" ("f[$1]")":"")

         # update p ready for next line
         p=$1
      }
   ' a="$1" b="$2"
}

funkydiff File1 File2

GNU diff does most of the hard work.

awk checks (and deletes) the filename identifier prefix that diff adds at the start of each line and appends the appropriate filename in brackets where necessary.


With your amended (simpler) requirement to print the filenames on every line and not just the first of each group, the awk can be simplified:

funkydiff2(){
   diff -d \
      --old-line-format='1 %dn: %L' \
      --new-line-format='2 %dn: %L' \
      --unchanged-line-format='' \
      -- "$1" "$2" \
   | awk '
      !p { f[1]=a; f[2]=b }

      # always append bracketed filename
      # update p as a side-effect (just for brevity)
      { print substr($0,3) " (" f[p=$1] ")" }
   ' a="$1" b="$2"
}

In fact, awk is not required at all if the filenames don't contain the special % character:

funkydiff3(){
    case "$1$2" in
        *%*)
            echo 1>&2 "ERROR: funky filename. Aborting"
            return 1
            ;;
    esac

    # now $1 and $2 cannot contain the % metacharacter
    # %c'\012' produces newline
    # alternatively you could embed literal newlines
    diff -d \
        --old-line-format="%dn: %l ($1)%c'\012'" \
        --new-line-format="%dn: %l ($2)%c'\012'" \
        --unchanged-line-format="" \
        -- "$1" "$2"
}

Because the filenames become embedded directly in the format strings passed to diff, if they contained any % the format string would either change to something unintended, become malformed, or cause the filenames to be mangled on output.

The man-page for GNU diff contains details of the % sequences that it allows in format strings.

Upvotes: 2

Related Questions