Reputation: 107
I am trying to remove the string after the second underscore in the second column with AWK.
Here is my input data:
OTU10015 uncultured_Ascomycota_C31_F02_Lineage=Root Fungi
OTU10071 Fusarium_sp._NRRL_52720_Lineage=Root Fungi
OTU10082 Colletotrichum_dematium_BBA_62147_Lineage=Root Fungi
The expected output is:
OTU10015 uncultured_Ascomycota Fungi
OTU10071 Fusarium_sp. Fungi
OTU10082 Colletotrichum_dematium Fungi
I tried this code:
awk '{sub(/([^_]).*/,"",$2);print $1,$2,$3}' file1> file2
I found this code from another post and tried to modify it, but it will remove the entire second column.
How can I further modified the code? Thanks in advance!
Upvotes: 1
Views: 494
Reputation: 85580
Using a regex based approach with sub()
seems a wrong approach when you have a function like split()
which can tackle the problem easily.
You just use the split()
function to split on the _
and use only the first two words. This is as minimal as you can get without disturbing the rest of the fields in the file.
awk '{ split($2, arr, "_"); $2=arr[1]"_"arr[2] }1' file
Printing the fields manually using print
is rarely ever needed when you are modifying just one of the records. Doing a { .. }1
re-constructs the whole line based on the modifications based on any of the fields in the line. By virtue of modifying only $2
, the whole line is reconstructed with the modification.
Upvotes: 4