Reputation: 606
Both of my files looks like this
NC_000001.11:g.100007038C>T
NC_000001.11:g.100007039C>A
NC_000001.11:g.100007038C>T NM_001271684.2:c.347C>T NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T NM_001271685.2:c.473C>T NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T NM_012243.3:c.347C>T NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A NM_001271684.2:c.348G>A NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A NM_001271685.2:c.474G>A NP_001258614.1:p.Thr158%3D
My desired output:
I want to match the first column from file2
to the first column of my file1
. If the match is true, then I want to add to append the third column of file2
to a new column in file1
such as to obtain:
NC_000001.11:g.100007038C>T NP_001258613.1:p.Thr116Met, NP_001258614.1:p.Thr158Met, NP_036375.1:p.Thr116Met
Here's my try:
awk 'BEGIN{ FS=OFS="\t" }
NR==FNR {a[$0]; next;}
{
for (k in a) {
if ($1 == k) {
print $0 "\t" a[$3]
}
}
}' file1.txt file2.txt
But will not produce my desired output:
NC_000001.11:g.100007038C>T NM_001271684.2:c.347C>T NP_001258613.1:p.Thr116Met
NC_000001.11:g.100007038C>T NM_001271685.2:c.473C>T NP_001258614.1:p.Thr158Met
NC_000001.11:g.100007038C>T NM_012243.3:c.347C>T NP_036375.1:p.Thr116Met
NC_000001.11:g.100007039G>A NM_001271684.2:c.348G>A NP_001258613.1:p.Thr116%3D
NC_000001.11:g.100007039G>A NM_001271685.2:c.474G>A NP_001258614.1:p.Thr158%3D
Thank you in advance.
PS: file1
contains unique entries. file2
is sorted, tab-separated and contains more than 3 million entries.
What I meant by tab-separated is that the new column is appended as tab-separated but the values inside that column are comma separated.
Upvotes: 1
Views: 287
Reputation: 37404
Another awk. Your sample code said FS=OFS="\t"
but output ", "
so I used the former latter. Also, your expected output is not met as only one item in file1
is matched in file2
.
$ awk '
BEGIN {
FS=OFS="\t" # delims
}
NR==FNR { # process file1
a[$0] # hash key only to see if values in file2
next
}
($1 in a) { # if found in file1
a[$1]=a[$1] (a[$1]==""?"":", ") $3 # append to the corresponding item
}
END { # in the end
for(i in a)
if(a[i]!="") # print all non-empty ones
print i,a[i]
}' file1 file2
Output with your data:
NC_000001.11:g.100007038C>T NP_001258613.1:p.Thr116Met NP_001258614.1:p.Thr158Met NP_036375.1:p.Thr116Met
Upvotes: 2
Reputation: 133518
Could you please try following, written and tested with shown samples.
awk '
BEGIN{
OFS=", "
}
FNR==NR{
array[$1]=(array[$1]?array[$1] OFS:"")$NF
next
}
($1 in array){
print $1"\t"array[$1]
}
' Input_file2 Input_file1
In case you want to save output into Input_file1 then try following:
awk '
BEGIN{
OFS=", "
}
FNR==NR{
array[$1]=(array[$1]?array[$1] OFS:"")$NF
next
}
($1 in array){
print $1"\t"array[$1]
}
' Input_file2 Input_file1 > temp && mv temp Input_file1
Explanation: Adding detailed explanation for above code here.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this awk program from here.
OFS=", " ##Setting OFS as comma space here.
}
FNR==NR{ ##Checking condition FNR==NR which will be true when Input_file2 is being read.
array[$1]=(array[$1]?array[$1] OFS:"")$NF ##Creating array with index $1 and value is last field of line.
next ##next will skip all further statements from here.
}
($1 in array){ ##Checking condition if 1st field of current line is present in array then do following.
print $1"\t"array[$1] ##Printing first column TAB and then value of array with index $1 here.
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.
Upvotes: 2