geomarine
geomarine

Reputation: 11

why should awk match two fields in two files fail?

I have two TEST files t.xyz and a.xyz which have three columns each. a.xyz have more rows than t.xyz. I will like to output rows at which $1 and $2 of t.xyz match $1 and $2 of a.xyz. Total number of output rows should be equal to that of t.xyz. It works fine, but when I apply it to large file, the output is more than in t.xyz. Any help to fix this will be appreciated.

I use the following:

awk 'FNR==NR{a[$1];b[$2];next} $1 in a && $2 in b'  t.xyz a.xyz > out.xyz
t.xyz
1907.05604682 2983.53399456 -5435.67749023
1908.05607621 2983.53399456 -3593.08154297
1910.05613499 2983.53399456 -1238.71289063
1911.05616438 2983.53399456 -4244.93823242
1912.05619377 2983.53399456 -3595.24414063
1913.05622316 2983.53399456 -2454.96728516
1923.05651706 2983.53399456 NaN

a.xyz
1907.05604682 2983.53399456 35.67749023
1908.05607621 2983.53399456 93.08154297
1910.05613499 2983.53399456 38.71289063
1911.05616438 2983.53399456 44.93823242
1912.05619377 2983.53399456 95.24414063
1913.05622316 2983.53399456 54.96728516
1923.05651706 2983.53399456 NaN
631.018545121 2646.58662319 24.715881348
635.018662681 2646.58662319 27.13696289

expected out.xyz
1907.05604682 2983.53399456 35.67749023
1908.05607621 2983.53399456 93.08154297
1910.05613499 2983.53399456 38.71289063
1911.05616438 2983.53399456 44.93823242
1912.05619377 2983.53399456 95.24414063
1913.05622316 2983.53399456 54.96728516
1923.05651706 2983.53399456 NaN

Upvotes: 0

Views: 24

Answers (1)

karakfa
karakfa

Reputation: 67467

$ awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' file1 file2

1907.05604682 2983.53399456 35.67749023
1908.05607621 2983.53399456 93.08154297
1910.05613499 2983.53399456 38.71289063
1911.05616438 2983.53399456 44.93823242
1912.05619377 2983.53399456 95.24414063
1913.05622316 2983.53399456 54.96728516
1923.05651706 2983.53399456 NaN

however, if there is no uniqueness constraint on file2 values, any matching entry will print. If you want to print the first matching entry only

$ awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a{print; delete a[$1,$2]}' file1 file2

will do, you may also print them all but indicate that there are duplicates

$ awk 'NR==FNR      {a[$1,$2]; next} 
       ($1,$2) in a {c=a[$1,$2]++; print $0, (c>1)?c:"" }' file1 file2

this can be done after the output file generated as well.

Upvotes: 2

Related Questions