Mishal Ahmed
Mishal Ahmed

Reputation: 191

Using AWK to merge two files based on multiple conditions

I know this question has been asked several times before. Here is one example:

Using AWK to merge two files based on multiple columns

My goal is to print out columns 2, 4, 5 and 7 of file_b and columns 17 and 18 of file_a if the following match occurs: Columns 2, 6 and 7 of file_a.csv matches with Columns 2, 4 and 5 of file_b.csv respectively.

But no matter how much I try, I can't get it to work for my case. Here are my two files:

file_a.csv

col2, col6, col7, col17, col18
a, b, c, 145, 88
e, f, g, 101, 96
x, y, z, 243, 222

file_b.csv

col2, col4, col5, col7
a, b, c, 4.5
e, f, g, 6.3
x, k, l, 12.9

Output should look like this:

col2, col4, col5, col7, col17, col18
a, b, c, 4.5, 145, 88
e, f, g, 6.3, 101, 96

I tried this:

awk -F, -v RS='\r\n' 'NR==FNR{key[$2 FS $6 FS $7]=$17 FS $18;next} {if($2 FS $4 FS $5 in key); print $2 FS $4 FS $5 FS $7 FS key[$2 FS $6 FS $7]}' file_a.csv file_b.csv > out.csv

Currently the output I am getting is:

col2, col4, col5, col7,
a, b, c, 4.5,
e, f, g, 6.3,

In other words, col17 and col18 from file_a is not showing up.

Yesterday I asked a related question where I was having issues with line breaks. That got answered and solved but now I think this problem is related to checking the if condition.

Update: I am sharing links to truncated copies of the actual data. The only difference between these files and the actual ones are that the real ones have millions of rows. These ones only have 10 each.

file_a.csv

file_b.csv

Upvotes: 2

Views: 410

Answers (3)

stack0114106
stack0114106

Reputation: 8791

Since you have mentioned that the file is huge, you can give a try to Perl, if that is an option.

The files are assumed to have "\r".

$ cat file_a.csv
col2, col6, col7, col17, col18
a, b, c, 145, 88
e, f, g, 101, 96
x, y, z, 243, 222
$ cat file_b.csv
col2, col4, col5, col7
a, b, c, 4.5
e, f, g, 6.3
x, k, l, 12.9
$ perl -F, -lane 'BEGIN { %kv=map{chomp;chop;@a=split(",");"$a[0],$a[1],$a[2]"=>"$a[3]"} qx(cat file_b.csv) } if($.>1){ $x="$F[0],$F[1],$F[2]";chomp($F[-1]);print "$x,$kv{$x}",join(",",@F[-2,-1]) if $kv{$x} } ' file_a.csv
a, b, c, 4.5 145, 88
e, f, g, 6.3 101, 96
$

Upvotes: 0

Tyl
Tyl

Reputation: 5262

Please try this (GNU sed):

awk 'BEGIN{RS="\r\n";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}'

This is the time BEGIN block kicks in. Also OFS kicks in.
When we are printing out many fields which separated by same thing, we can set OFS, and simply put comma between the things we want to print.

There's no need to check key in arr when you've assigned value for a key in the array,
by default, when arr[somekey] isn't assigned before, it's empty/"", and it evaluates to false in awk (0 in scalar context), and a non-empty string is evaluates to true (There's no literally true and false in awk).
(You used wrong array name, the $2,$6,$7 is the key in the array arr here. It's confusing to use key as array name.)

You can test some simple concept like this:

awk 'BEGIN{print arr["newkey"]}'

You don't need a input file to execute BEGIN block.

Also, you can use quotes sometimes, to avoid confusion and underlying problem.

Update: Your files actually ends in \n, if you can't be sure what the line ending is, use this:

awk 'BEGIN{RS="\r\n|\n|\r";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv

or this (This one will ignore empty lines):

awk 'BEGIN{RS="[\r\n]+";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv

Also, it's better to convert first to avoid such situations, by:

sed -i 's/\r//' files

Or you can use dos2unix command:

dos2unix file

It's a handy commandline tool do above thing only.
You can install it if you don't have it in your system yet.
Once converted, you don't need to assign RS in normal situations.

Upvotes: 1

karakfa
karakfa

Reputation: 67567

$ awk 'BEGIN      {RS="\r\n"; FS=OFS=","}
       NR==FNR    {a[$2,$6,$7]=$17 OFS $18; next} 
  ($2,$4,$5) in a {print $2,$4,$5,$7,a[$2,$4,$5]}' file1 file2 > output

Your main issue is, in the array lookup the index you should use is the second file key, not the first file key. Also the semicolon after the if condition is wrong. The rest is cosmetics only.

Not sure you want the output \r\n terminated, if so set ORS=RS as well, otherwise it's newline only.

Upvotes: 0

Related Questions