James Brown
James Brown

Reputation: 37394

Using AWK to merge two files based on multiple columns

I have two CSV files, with ; (semicolon) as a separator, that I need to merge based on three columns on each file using AWK. The key columns are not consecutive. Idea is to get two columns from file B and print them after all the other columns from file A.

File A (keys are in A1, A3, and A5):

A1;A2;A3;A4;A5
K1;D1;K2;D2;K3
K4;D3;K5;D4;K6
K7;D5;K8;D6;K9
K1;D7;K2;D8;K3

File B (keys in B1, B2, B4):

B1;B2;B3;B4;B5
K1;K2;D9;K3;D0
K4;K5;DA;K6;DB
KA;KB;DC;KC;DD

Would produce:

A1;A2;A3;A4;A5;;
K1;D1;K2;D2;K3;D9;D0
K4;D3;K5;D4;K6;DA;DB
K7;D5;K8;D6;K9;;
K1;D7;K2;D8;K3;D9;D0

I have found several examples here in SO (for example How to merge two files based on the first three columns using awk and How to merge two files using AWK?) and elsewhere but I haven't been able to convert them to my needs, as they haven't been documented so well that an AWK n00b like myself would really understand how they work.

Closest I've gotten is:

awk -F \; -v OFS=\; 'FNR==NR{c[$1]=$3 FS $5;next}{ print $0, c[$1]}' B A

But it still leaves out one semicolon--or a column--from output lines 1 and 4:

A1;A2;A3;A4;A5;
K1;D1;K2;D2;K3;D9;D0
K4;D3;K5;D4;K6;DA;DB
K7;D5;K8;D6;K9;
K1;D7;K2;D8;K3;D9;D0

An how do I state which columns I want to use for comparing? Apparently now it's only using first column for comparing.

Upvotes: 0

Views: 3028

Answers (3)

karakfa
karakfa

Reputation: 67467

This will print without the extra ; on unmatched lines. You have to provide B file first.

 awk 'BEGIN {
          OFS=FS=";"
      } 

      FNR==NR {
          key[$1 FS $2 FS $4]=$3 OFS $5
      } 

      FNR!=NR {
          c=$1 FS $3 FS $5; 
          if(c in key) 
               print $0,key[c]; 
          else 
               print
      }'  fileB fileA

if you need the extra delimiters, change the last print to print $0 OFS OFS

Upvotes: 2

user000001
user000001

Reputation: 33317

Not sure if I understood the requirement properly, but this gives the expected output for the given input:

awk -F \; -v OFS=\; 'FNR==NR{c[$1]=$3 FS $5;next}{ print $0, $1 in c ? c[$1] : ";"}' B A
A1;A2;A3;A4;A5;;
K1;D1;K2;D2;K3;D9;D0
K4;D3;K5;D4;K6;DA;DB
K7;D5;K8;D6;K9;;
K1;D7;K2;D8;K3;D9;D0

From the code in the question, I changed the print statement from

print $0, c[$1]

to

print $0, $1 in c ? c[$1] : ";"

Upvotes: 0

FelixJN
FelixJN

Reputation: 570

join -j1 -a1 -t';' <(cut -d';' -f 1,3,5 A | sed -e 's/;//g' | paste -d';' - A | sort ) <(cut -d';' -f 1,2,4 B | sed -e 's/;//g' | paste -d';' - B | sort ) | cut -d';' -f2,3,4,5,6,9,11

#the commands on new lines for readability only
#join command, print all of file A, even if unmatching
join -j1 -a1 -t';'
#input from file A
<(cut -d';' -f 1,3,5 A | sed -e 's/;//g' | paste -d';' - A | sort )
#input from file B
<(cut -d';' -f 1,2,4 B | sed -e 's/;//g' | paste -d';' - B | sort )
#selecting the columns
| cut -d';' -f2,3,4,5,6,9,11

in each case:

1) create a dummy field from the desired columns of file A or B

2) then use paste to create each pseudo file as dummy comparison field; rest of file

3) sort the output for usability with join

4) use join on basis of the dummy field

5) cut the desired columns from the matches join produces

Upvotes: 1

Related Questions