AWK: keep appending information to the same row in a new column to file1 from file2 by matching a particular column

Question

Both of my files looks like this

file1

NC_000001.11:g.100007038C>T
NC_000001.11:g.100007039C>A

file2

NC_000001.11:g.100007038C>T     NM_001271684.2:c.347C>T     NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T     NM_001271685.2:c.473C>T     NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T     NM_012243.3:c.347C>T        NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A     NM_001271684.2:c.348G>A     NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A     NM_001271685.2:c.474G>A     NP_001258614.1:p.Thr158%3D

My desired output:

I want to match the first column from file2 to the first column of my file1. If the match is true, then I want to add to append the third column of file2 to a new column in file1 such as to obtain:

file1

NC_000001.11:g.100007038C>T     NP_001258613.1:p.Thr116Met, NP_001258614.1:p.Thr158Met, NP_036375.1:p.Thr116Met

Here's my try:

awk 'BEGIN{ FS=OFS="	" }
NR==FNR {a[$0]; next;}
{
        for (k in a) {
            if ($1 == k) {
                print $0 "	" a[$3]
            }
        }
}' file1.txt file2.txt

But will not produce my desired output:

NC_000001.11:g.100007038C>T     NM_001271684.2:c.347C>T     NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T     NM_001271685.2:c.473C>T     NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T     NM_012243.3:c.347C>T        NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A     NM_001271684.2:c.348G>A     NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A     NM_001271685.2:c.474G>A     NP_001258614.1:p.Thr158%3D

Thank you in advance.

PS: file1 contains unique entries. file2 is sorted, tab-separated and contains more than 3 million entries.

Edit:

What I meant by tab-separated is that the new column is appended as tab-separated but the values inside that column are comma separated.

RavinderSingh13 · Accepted Answer

Could you please try following, written and tested with shown samples.

awk '
BEGIN{
  OFS=", "
}
FNR==NR{
  array[$1]=(array[$1]?array[$1] OFS:"")$NF
  next
}
($1 in array){
  print $1"	"array[$1]
}
'  Input_file2   Input_file1

In case you want to save output into Input_file1 then try following:

awk '
BEGIN{
  OFS=", "
}
FNR==NR{
  array[$1]=(array[$1]?array[$1] OFS:"")$NF
  next
}
($1 in array){
  print $1"	"array[$1]
}
'  Input_file2   Input_file1 > temp && mv temp Input_file1

Explanation: Adding detailed explanation for above code here.

awk '                                                ##Starting awk program from here.
BEGIN{                                               ##Starting BEGIN section of this awk program from here.
  OFS=", "                                           ##Setting OFS as comma space here.
}
FNR==NR{                                             ##Checking condition FNR==NR which will be true when Input_file2 is being read.
  array[$1]=(array[$1]?array[$1] OFS:"")$NF          ##Creating array with index $1 and value is last field of line.
  next                                               ##next will skip all further statements from here.
}
($1 in array){                                       ##Checking condition if 1st field of current line is present in array then do following.
  print $1"	"array[$1]                              ##Printing first column TAB and then value of array with index $1 here.
}
'  Input_file2  Input_file1                          ##Mentioning Input_file names here.

AWK: keep appending information to the same row in a new column to file1 from file2 by matching a particular column

file1

file2

file1

Edit:

Answers (2)

Related Questions