prasanta T
prasanta T

Reputation: 1

Comparing two files column by column in unix shell

I need to compare two files column by column using unix shell, and store the difference in a resulting file.

For example if column 1 of the 1st record of the 1st file matches the column 1 of the 1st record of the 2nd file then the result will be stored as '=' in the resulting file against the column, but if it finds any difference in column values the same need to be printed in the resulting file.

Below is the exact requirement.

File 1:

id  code name  place 
123 abc  Tom   phoenix
345 xyz  Harry seattle
675 kyt  Romil newyork

File 2:

id  code name  place
123 pkt  Rosy  phoenix
345 xyz  Harry seattle
421 uty  Romil Sanjose

Expected resulting file:

id_1 id_2 code_1 code_2 name_1 name_2 place_1 place_2
=    =      abc  pkt     Tom    Rosy   =       =
=    =      =    =       =      =      =       =
675  421    kyt  uty     =      =      Newyork Sanjose

Columns are tab delimited.

Upvotes: 0

Views: 712

Answers (1)

sjnarv
sjnarv

Reputation: 2384

This is rather crudely coded, but shows a way to use awk to emit what you want, and can handle files of identical "schema" - not just the particular 4-field files you give as tests.

This approach uses pr to do a simple merge of the files: the same line of each input file is concatenated to present one line to the awk script.

The awk script assumes clean input, and uses the fact that if a variable n has the value 2, the value of $n in the script is the the same as $2. So, the script walks though pairs of fields using the i and j variables. For your test input, fields 1 and 5, then 2 and 6, etc., are processed.

Only very limited testing of input is performed: mainly, that the implied schema of the two input files (the names of columns/fields) is the same.

#!/bin/sh

[ $# -eq 2 ] || { echo "Usage: ${0##*/} <file1> <file2>" 1>&2; exit 1; }
[ -r "$1" -a -r "$2" ] || { echo "$1 or $2: cannot read" 1>&2; exit 1; }

set -e

pr -s -t -m "$@" | \
awk '
  {
    offset = int(NF/2)
    tab = ""
    for (i = 1; i <= offset; i++) {
      j = i + offset
      if (NR == 1) {
        if ($i != $j) {
          printf "\nColumn name mismatch (%s/%s)\n", $i, $j > "/dev/stderr"
          exit
        }
        printf "%s%s_1\t%s_2", tab, $i, $j
      } else if ($i == $j) {
        printf "%s=\t=", tab
      } else {
        printf "%s%s\t%s", tab, $i, $j
      }
      tab = "\t"
    }
    printf "\n"
  }
'

Tested on Linux: GNU Awk 4.1.0 and pr (GNU coreutils) 8.21.

Upvotes: 1

Related Questions