user324810
user324810

Reputation: 606

AWK: keep appending information to the same row in a new column to file1 from file2 by matching a particular column

Both of my files looks like this

file1

NC_000001.11:g.100007038C>T
NC_000001.11:g.100007039C>A

file2

NC_000001.11:g.100007038C>T     NM_001271684.2:c.347C>T     NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T     NM_001271685.2:c.473C>T     NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T     NM_012243.3:c.347C>T        NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A     NM_001271684.2:c.348G>A     NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A     NM_001271685.2:c.474G>A     NP_001258614.1:p.Thr158%3D

My desired output:

I want to match the first column from file2 to the first column of my file1. If the match is true, then I want to add to append the third column of file2 to a new column in file1 such as to obtain:

file1

NC_000001.11:g.100007038C>T     NP_001258613.1:p.Thr116Met, NP_001258614.1:p.Thr158Met, NP_036375.1:p.Thr116Met

Here's my try:

awk 'BEGIN{ FS=OFS="\t" }
NR==FNR {a[$0]; next;}
{
        for (k in a) {
            if ($1 == k) {
                print $0 "\t" a[$3]
            }
        }
}' file1.txt file2.txt

But will not produce my desired output:

NC_000001.11:g.100007038C>T     NM_001271684.2:c.347C>T     NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T     NM_001271685.2:c.473C>T     NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T     NM_012243.3:c.347C>T        NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A     NM_001271684.2:c.348G>A     NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A     NM_001271685.2:c.474G>A     NP_001258614.1:p.Thr158%3D

Thank you in advance.

PS: file1 contains unique entries. file2 is sorted, tab-separated and contains more than 3 million entries.

Edit:

What I meant by tab-separated is that the new column is appended as tab-separated but the values inside that column are comma separated.

Upvotes: 1

Views: 287

Answers (2)

James Brown
James Brown

Reputation: 37404

Another awk. Your sample code said FS=OFS="\t" but output ", " so I used the former latter. Also, your expected output is not met as only one item in file1 is matched in file2.

$ awk '
BEGIN {
    FS=OFS="\t"                         # delims
}
NR==FNR {                               # process file1
    a[$0]                               # hash key only to see if values in file2
    next
}
($1 in a) {                             # if found in file1
    a[$1]=a[$1] (a[$1]==""?"":", ") $3  # append to the corresponding item
}
END {                                   # in the end
    for(i in a) 
        if(a[i]!="")                    # print all non-empty ones
            print i,a[i]
}' file1 file2

Output with your data:

NC_000001.11:g.100007038C>T     NP_001258613.1:p.Thr116Met      NP_001258614.1:p.Thr158Met      NP_036375.1:p.Thr116Met

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133518

Could you please try following, written and tested with shown samples.

awk '
BEGIN{
  OFS=", "
}
FNR==NR{
  array[$1]=(array[$1]?array[$1] OFS:"")$NF
  next
}
($1 in array){
  print $1"\t"array[$1]
}
'  Input_file2   Input_file1

In case you want to save output into Input_file1 then try following:

awk '
BEGIN{
  OFS=", "
}
FNR==NR{
  array[$1]=(array[$1]?array[$1] OFS:"")$NF
  next
}
($1 in array){
  print $1"\t"array[$1]
}
'  Input_file2   Input_file1 > temp && mv temp Input_file1

Explanation: Adding detailed explanation for above code here.

awk '                                                ##Starting awk program from here.
BEGIN{                                               ##Starting BEGIN section of this awk program from here.
  OFS=", "                                           ##Setting OFS as comma space here.
}
FNR==NR{                                             ##Checking condition FNR==NR which will be true when Input_file2 is being read.
  array[$1]=(array[$1]?array[$1] OFS:"")$NF          ##Creating array with index $1 and value is last field of line.
  next                                               ##next will skip all further statements from here.
}
($1 in array){                                       ##Checking condition if 1st field of current line is present in array then do following.
  print $1"\t"array[$1]                              ##Printing first column TAB and then value of array with index $1 here.
}
'  Input_file2  Input_file1                          ##Mentioning Input_file names here.

Upvotes: 2

Related Questions