mike
mike

Reputation: 49

awk command for string replacement and printing matched and unmatched strings

I want to replace multiple strings (more than thousand) in File-1 with matching string from File-2

File-1:

Geneid Length s1 s2
1_1 6571 7 8
1_2 5041 3 0
1_3 1032 7 3    
1_4 1212 3 5    
1_5 1071 3 5    
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
2_4 1056 5 1

File-2 (map):

1_1
1_2 k0002
1_3
1_4
1_5 k0006   
2_1
2_2
2_3
2_4 k0528

Expected output:

Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3    
1_4 1212 3 5    
k0006 1071 3 5  
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1

I used the following awk command:

awk '
NR==FNR {                      
a[$1]=$2                    
next                       
}
{                               
print (($1 in a)?a[$1]:$1, $2, $3, $4)  
}' File-2 File-1 > File-3

which gives me this:

Geneid  Length  s1  s2
 6571 7 8
k0002 5041 3 0
 1032 7 3   
 1212 3 5   
k0006 1071 3 5  
 7171 2 7
 1038 1 1
 9361 0 6
k0528 1056 5 1

How to modify this awk command to keep unmatched strings?
Sorry, I am a newbie to linux and awk (trying to learn).

Upvotes: 2

Views: 293

Answers (3)

Ed Morton
Ed Morton

Reputation: 203655

$ awk '
    NR==FNR { if (NF>1) a[$1]=$2; next }
    $1 in a { $1=a[$1] }
1' file2 file1
Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
k0006 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1
  1. if (NF>1) efficiently ensures you only populate a[] with values from file2 that you need, i.e. those that have a 2nd field,
  2. $1 in a ensures you only change $1 from file when an associated entry existed in file2. Do not test a[$1]=="" or anything similar instead as that will populate a[] for every $1 in file1 and so use up memory and increase execution time.
  3. 1 at the end causes the current, possibly just-modified, line from file1 to be printed.

Upvotes: 1

tripleee
tripleee

Reputation: 189427

The expression ($1 in a)?a[$1]:$1 prints either a[$1] or $1 depending on whether $1 is a key in a. But all your keys are in a, so for example, for the key 1_1, it prints the empty string which is the value of a["1_1"]. The solution is to only populate a when there is a value to add for the key in $1.

awk 'NR==FNR { if (NF > 1) a[$1]=$2; next }
{ print (($1 in a)?a[$1]:$1, $2, $3, $4) }' File-2 File-1

For debugging a script like yours, it helps to add print statements at various points to see what the script is doing. Here's what I ended up doing to figure out what was wrong with your script.

# STILL BUGGY, DEBUGGING RUN
awk 'NR==FNR { print("a[" $1 "]=" $2); a[$1]=$2; next; }
{ print ($1 in a ? a[$1] : $1), $2, $3, $4, ($1 in a), a[$1], $1, ($1 in a ? "yes" : "no"), "end" }' File-2 File-1

Upvotes: 2

Tyl
Tyl

Reputation: 5252

Given that File-2 won't be empty:

awk 'NR==FNR{a[$1]=$2;next}a[$1]!=""{$1=a[$1]}1' File-2 File-1
Geneid Length s1 s2
1_1 6571 7 8
k0002 5041 3 0
1_3 1032 7 3
1_4 1212 3 5
k0006 1071 3 5
2_1 7171 2 7
2_2 1038 1 1
2_3 9361 0 6
k0528 1056 5 1

If it can be empty, and with GNU awk, you can replace NR==FNR with ARGIND==1 or FILENAME=="File-2".

Upvotes: 0

Related Questions