Patrick
Patrick

Reputation: 65

bash - compare two columns of one file with one column of second file and print matches

I have two different files with around 1000 lines each that are structured like this:

file1: (First Name; Last Name; Address)

Mike;Tyson;First Street 2
Tom;Boyden;Second Street 6
Tom;Cruise;Third Street 9
Mike;Myers;Second Street 4

file2: (First Name Last Name; E-Mail; ID) OR (Last Name First Name;E-Mail; ID)

Mike Tyson;[email protected];45753
Cruise Tom;[email protected];23562
Jennifer Lopez;[email protected];92746
Brady Tom;[email protected];27583

I would like to compare the first two columns of file1 with the ENTIRE first column of file2. If both entries of file1 are present in the first column of file2 (in either order) I want to print the matched line of file1. Then search for the second line of file1 and again compare it to the entire column of file2 and so on.

In file2 the order can be (First Name Last Name) OR (Last Name First Name) and I want to print the matched line in both cases.

Expected Output:

Mike;Tyson;First Street 2
Tom;Cruise;Third Street 9

I'm happy with a solution using awk, grep or anything else.

I've tried the solution of a similar question but the output is empty:

awk -F';' 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0' file1 file2

Thanks

Upvotes: 2

Views: 689

Answers (2)

Ed Morton
Ed Morton

Reputation: 203209

$ awk -F'[ ;]' '
    { key=($1 > $2 ? $1 FS $2 : $2 FS $1) }
    NR==FNR { a[key]; next }
    key in a
' file1 file2
Mike Tyson;[email protected];45753
Cruise Tom;[email protected];23562

The above uses the common, idiomatic approach to generating a consistent key regardless of the order in which the key components appear by sorting the components before concatenating them to create the key value. When there's only 2 components as in this case a simple comparison is the only sorting required.

Here's why sorting the components of the key is the right approach. Imagine you have 3 components, $1, $2, and $3, instead of just 2. With the approach of testing every combination you need this:

NR==FNR { a[$1,$2,$3]; next }
($1,$2,$3) in a || ($1,$3,$2) in a || ($2,$1,$3) in a ||
($2,$3,$1) in a || ($3,$1,$2) in a || ($3,$2,$1) in a

Try writing that condition for $1 through $4 :-).

By contrast if you use the approach of sorting the components you need this (using GNU awk for built in sort functions for convenience) which is MUCH harder to get wrong (e.g. by forgetting a combination in the comparison):

NR==FNR {
    split($1 FS $2 FS $3,flds)
    asort(flds)
    key = flds[1]
    for (i=2; i in flds; i++) {
        key = key FS flds[i]
    }
    a[key]
    next
}
key in a

Now imagine if you wanted to use $1 through $10 in any order. The "test every combination of components approach" becomes an untenable nightmare while the "sort the components to create the key" approach just means trivially adding fields to the list in the first split() argument.

Upvotes: 4

RavinderSingh13
RavinderSingh13

Reputation: 133428

Could you please try following.

awk '
FNR==NR{
  array[$1,$2]
  next
}
(($1,$2) in array) || (($2,$1) in array)
' FS="[ ;]"  Input_file2  FS=";" Input_file1

Explanation: Adding detailed explanation for above solution.

awk '                                       ##Starting awk program from here.
FNR==NR{                                    ##Checking condition if FNR==NR which will be true when file2 is being read.
  array[$1,$2]                              ##Creating array with index $1,$2 here.
  next                                      ##next will skip all further statement from here.
}
(($1,$2) in array) || (($2,$1) in array)    ##Checking condition if $1,$2 OR $2,$1 is present in array then it will print the line from Input_file1.
' FS="[ ;]"  file2  FS=";" file1            ##Set field separator space or semi-colon for file2 AND set field separator as ; for file1 here.

Upvotes: 2

Related Questions