Reputation: 530
I am trying to replace values in column 1 of file1 using a lookup table. A sample (tab separated):
chr1 1243 A T 0.14
chr5 1432 G C 0.0006
chr10 731 T C 0.9421
chr11 98234 T G .000032
chr12 1284 A T 0.93428
chr17 941 G T 0.1111
chr19 134325 T C 0.00001
chr21 9824 T C 0.9
Lookup table:
chr1 NC_000001.11
chr2 NC_000002.12
chr3 NC_000003.12
chr4 NC_000004.12
chr5 NC_000005.10
chr6 NC_000006.12
chr7 NC_000007.14
chr8 NC_000008.11
chr9 NC_000009.12
chr10 NC_000010.11
chr11 NC_000011.10
chr12 NC_000012.12
chr13 NC_000013.11
chr14 NC_000014.9
chr15 NC_000015.10
chr16 NC_000016.10
chr17 NC_000017.11
chr18 NC_000018.10
chr19 NC_000019.10
chr20 NC_000020.11
chr21 NC_000021.9
chr22 NC_000022.11
script being used:
awk 'FNR==NR{a[$1]=$2;next} {for (i in a)sub(i,a[i]);print' lookup.txt file1 > new_table.txt
output with comment on which line is correct/incorrect (with right answer in brackets):
NC_000001.11 1243 A T 0.14 #correct
NC_000005.10 1432 G C 0.0006 #correct
NC_000001.110 731 T C 0.9421 #incorrect (NC_000010.11)
NC_000001.111 98234 T G .000032 #incorrect (NC_000011.10)
NC_000012.12 1284 A T 0.93428 #correct
NC_000001.117 941 G T 0.1111 #incorrect (NC_000017.11)
NC_000001.119 134325 T C 0.00001 #incorrect (NC_000019.10)
NC_000021.9 9824 T C 0.9 #correct
I don't understand the pattern of why it isn't working and would welcome any help with the awk script. I thought it was just those with double digits e.g. chr17 but then chr21 seems to work fine.
Many thanks
Upvotes: 1
Views: 552
Reputation: 157967
Shouldn't it be:
awk 'FNR==NR{a[$1]=$2;next}{$1=a[$1]}1' lookup.txt file1
?
Output:
NC_000001.11 1243 A T 0.14
NC_000005.10 1432 G C 0.0006
NC_000010.11 731 T C 0.9421
NC_000011.10 98234 T G .000032
NC_000012.12 1284 A T 0.93428
NC_000017.11 941 G T 0.1111
NC_000019.10 134325 T C 0.00001
NC_000021.9 9824 T C 0.9
Explanation:
# true as long as we are reading the first file, lookup.txt
FNR==NR {
# create a lookup array 'a' indexed by field 1 of lookup txt
a[$1]=$2
# don't process further actions
next
}
# because of the 'next' statement above, this will be only executed
# when we are processing the second file, file1
{
# translate field 1. use the value from the lookup array
$1=a[$1]
}
# always true. print the line
1
PS: If there's the possibility that entries can't be found in the lookup table, you could use a special text for them:
awk 'FNR==NR{a[$1]=$2;next}{$1=($1 in a)?a[$1]:"NOT FOUND "$1}1' lookup.txt file1
Upvotes: 4
Reputation: 12877
Looks like sub is causing the issue and so simply prepend the value for the index specified by $1 to the line with a space and print the line with short hand 1 and so:
awk 'FNR==NR{a[$1]=$2;next} {$0=a[$1]" "$0 }1' lookup.txt file1 > new_table.txt
Upvotes: 1
Reputation: 133518
I believe sub
could be the problem in OP's attempt, not checked thoroughly, this could be done simply by:
awk 'FNR==NR{arr[$1]=$2;next} ($1 in arr){first=$1;$1="";print arr[first],$0}' lookup_table Input_file
Problem with OP's attempt(Only for understanding purposes NOT to be run to get shown samples results): Though OP's code shown one doesn't look like complete one to dig it out why its giving wrong output as per OP's question, I have written it as follows.
awk 'FNR==NR{a[$1]=$2;next} {for (i in a){line=$0;if(sub(i,a[i])){print (Previous line)line">>>(array key)"i"....(array value)"a[i]"............(new line)"$0}}}' lookup_table Input_file
So whenever a proper substitution happens then only its printing the line as follows, where we could see whats going wrong with OP's code.
chr1 1243 A T 0.14 chr1 1243 A T 0.14 >>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.11 1243 A T 0.14
chr5 1432 G C 0.0006chr5 1432 G C 0.0006>>>(array key)chr5....(array value)NC_000005.10............(new line)NC_000005.10 1432 G C 0.0006
chr10 731 T C 0.9421chr10 731 T C 0.9421>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.110 731 T C 0.9421
chr11 98234 T G .000032chr11 98234 T G .000032>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.111 98234 T G .000032
chr12 1284 A T 0.93428chr12 1284 A T 0.93428>>>(array key)chr12....(array value)NC_000012.12............(new line)NC_000012.12 1284 A T 0.93428
chr17 941 G T 0.1111chr17 941 G T 0.1111>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.117 941 G T 0.1111
chr19 134325 T C 0.00001chr19 134325 T C 0.00001>>>(array key)chr1....(array value)NC_000001.11............(new line)NC_000001.119 134325 T C 0.00001
chr21 9824 T C 0.9chr21 9824 T C 0.9>>>(array key)chr21....(array value)NC_000021.9............(new line)NC_000021.9 9824 T C 0.9
Where we could easily see
old line
from chr1 1243 A T 0.14 chr1 1243 A T 0.14
to becomes like NC_000001.11 1243 A T 0.14
that's because array key(chr1) get substituted with array value (NC_000001.11) If you see output shown above for understanding purposes.
Upvotes: 2