Reputation: 49
I want to replace multiple strings (more than thousand) in File-1
with matching string from File-2
File-1
:
Geneid Length s1 s2
1_1 6571 7 8
1_2 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
1_5 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
2_4 1056 5 1
File-2
(map):
1_1
1_2 k0002
1_3
1_4
1_5 k0006
2_1
2_2
2_3
2_4 k0528
Expected output:
Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
k0006 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1
I used the following awk
command:
awk '
NR==FNR {
a[$1]=$2
next
}
{
print (($1 in a)?a[$1]:$1, $2, $3, $4)
}' File-2 File-1 > File-3
which gives me this:
Geneid Length s1 s2
6571 7 8
k0002 5041 3 0
1032 7 3
1212 3 5
k0006 1071 3 5
7171 2 7
1038 1 1
9361 0 6
k0528 1056 5 1
How to modify this awk
command to keep unmatched strings?
Sorry, I am a newbie to linux and awk
(trying to learn).
Upvotes: 2
Views: 293
Reputation: 203655
$ awk '
NR==FNR { if (NF>1) a[$1]=$2; next }
$1 in a { $1=a[$1] }
1' file2 file1
Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
k0006 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1
if (NF>1)
efficiently ensures you only populate a[]
with values from file2
that you need, i.e. those that have a 2nd field,$1 in a
ensures you only change $1
from file
when an
associated entry existed in file2
. Do not test a[$1]==""
or
anything similar instead as that will populate a[]
for every $1
in file1 and so use up memory and increase execution time.1
at the end causes the current, possibly just-modified, line from file1
to be printed.Upvotes: 1
Reputation: 189427
The expression ($1 in a)?a[$1]:$1
prints either a[$1]
or $1
depending on whether $1
is a key in a
. But all your keys are in a
, so for example, for the key 1_1
, it prints the empty string which is the value of a["1_1"]
. The solution is to only populate a
when there is a value to add for the key in $1
.
awk 'NR==FNR { if (NF > 1) a[$1]=$2; next }
{ print (($1 in a)?a[$1]:$1, $2, $3, $4) }' File-2 File-1
For debugging a script like yours, it helps to add print
statements at various points to see what the script is doing. Here's what I ended up doing to figure out what was wrong with your script.
# STILL BUGGY, DEBUGGING RUN
awk 'NR==FNR { print("a[" $1 "]=" $2); a[$1]=$2; next; }
{ print ($1 in a ? a[$1] : $1), $2, $3, $4, ($1 in a), a[$1], $1, ($1 in a ? "yes" : "no"), "end" }' File-2 File-1
Upvotes: 2
Reputation: 5252
Given that File-2
won't be empty:
awk 'NR==FNR{a[$1]=$2;next}a[$1]!=""{$1=a[$1]}1' File-2 File-1
Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
k0006 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1
If it can be empty, and with GNU awk
, you can replace NR==FNR
with ARGIND==1
or FILENAME=="File-2"
.
Upvotes: 0