Reputation: 191
I know this question has been asked several times before. Here is one example:
Using AWK to merge two files based on multiple columns
My goal is to print out columns 2, 4, 5 and 7 of file_b and columns 17 and 18 of file_a if the following match occurs: Columns 2, 6 and 7 of file_a.csv matches with Columns 2, 4 and 5 of file_b.csv respectively.
But no matter how much I try, I can't get it to work for my case. Here are my two files:
file_a.csv
col2, col6, col7, col17, col18
a, b, c, 145, 88
e, f, g, 101, 96
x, y, z, 243, 222
file_b.csv
col2, col4, col5, col7
a, b, c, 4.5
e, f, g, 6.3
x, k, l, 12.9
Output should look like this:
col2, col4, col5, col7, col17, col18
a, b, c, 4.5, 145, 88
e, f, g, 6.3, 101, 96
I tried this:
awk -F, -v RS='\r\n' 'NR==FNR{key[$2 FS $6 FS $7]=$17 FS $18;next} {if($2 FS $4 FS $5 in key); print $2 FS $4 FS $5 FS $7 FS key[$2 FS $6 FS $7]}' file_a.csv file_b.csv > out.csv
Currently the output I am getting is:
col2, col4, col5, col7,
a, b, c, 4.5,
e, f, g, 6.3,
In other words, col17 and col18 from file_a is not showing up.
Yesterday I asked a related question where I was having issues with line breaks. That got answered and solved but now I think this problem is related to checking the if condition.
Update: I am sharing links to truncated copies of the actual data. The only difference between these files and the actual ones are that the real ones have millions of rows. These ones only have 10 each.
Upvotes: 2
Views: 410
Reputation: 8791
Since you have mentioned that the file is huge, you can give a try to Perl, if that is an option.
The files are assumed to have "\r".
$ cat file_a.csv
col2, col6, col7, col17, col18
a, b, c, 145, 88
e, f, g, 101, 96
x, y, z, 243, 222
$ cat file_b.csv
col2, col4, col5, col7
a, b, c, 4.5
e, f, g, 6.3
x, k, l, 12.9
$ perl -F, -lane 'BEGIN { %kv=map{chomp;chop;@a=split(",");"$a[0],$a[1],$a[2]"=>"$a[3]"} qx(cat file_b.csv) } if($.>1){ $x="$F[0],$F[1],$F[2]";chomp($F[-1]);print "$x,$kv{$x}",join(",",@F[-2,-1]) if $kv{$x} } ' file_a.csv
a, b, c, 4.5 145, 88
e, f, g, 6.3 101, 96
$
Upvotes: 0
Reputation: 5262
Please try this (GNU sed):
awk 'BEGIN{RS="\r\n";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}'
This is the time BEGIN
block kicks in. Also OFS
kicks in.
When we are printing out many fields which separated by same thing, we can set OFS
, and simply put comma between the things we want to print.
There's no need to check key in arr
when you've assigned value for a key in the array,
by default, when arr[somekey]
isn't assigned before, it's empty
/""
, and it evaluates to false
in awk (0
in scalar context), and a non-empty string is evaluates to true
(There's no literally true
and false
in awk
).
(You used wrong array
name, the $2,$6,$7
is the key in the array arr
here. It's confusing to use key
as array name.)
You can test some simple concept like this:
awk 'BEGIN{print arr["newkey"]}'
You don't need a input file to execute BEGIN
block.
Also, you can use quotes sometimes, to avoid confusion and underlying problem.
Update:
Your files actually ends in \n
, if you can't be sure what the line ending is, use this:
awk 'BEGIN{RS="\r\n|\n|\r";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv
or this (This one will ignore empty lines):
awk 'BEGIN{RS="[\r\n]+";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv
Also, it's better to convert first to avoid such situations, by:
sed -i 's/\r//' files
Or you can use dos2unix
command:
dos2unix file
It's a handy commandline tool do above thing only.
You can install it if you don't have it in your system yet.
Once converted, you don't need to assign RS
in normal situations.
Upvotes: 1
Reputation: 67567
$ awk 'BEGIN {RS="\r\n"; FS=OFS=","}
NR==FNR {a[$2,$6,$7]=$17 OFS $18; next}
($2,$4,$5) in a {print $2,$4,$5,$7,a[$2,$4,$5]}' file1 file2 > output
Your main issue is, in the array lookup the index you should use is the second file key, not the first file key. Also the semicolon after the if condition is wrong. The rest is cosmetics only.
Not sure you want the output \r\n
terminated, if so set ORS=RS
as well, otherwise it's newline only.
Upvotes: 0