Helen
Helen

Reputation: 107

Remove string after second underscore with AWK

I am trying to remove the string after the second underscore in the second column with AWK.

Here is my input data:

OTU10015    uncultured_Ascomycota_C31_F02_Lineage=Root  Fungi
OTU10071    Fusarium_sp._NRRL_52720_Lineage=Root    Fungi
OTU10082    Colletotrichum_dematium_BBA_62147_Lineage=Root  Fungi

The expected output is:

OTU10015    uncultured_Ascomycota   Fungi
OTU10071    Fusarium_sp.    Fungi
OTU10082    Colletotrichum_dematium   Fungi

I tried this code:

awk '{sub(/([^_]).*/,"",$2);print $1,$2,$3}' file1> file2

I found this code from another post and tried to modify it, but it will remove the entire second column.

How can I further modified the code? Thanks in advance!

Upvotes: 1

Views: 494

Answers (1)

Inian
Inian

Reputation: 85580

Using a regex based approach with sub() seems a wrong approach when you have a function like split() which can tackle the problem easily.

You just use the split() function to split on the _ and use only the first two words. This is as minimal as you can get without disturbing the rest of the fields in the file.

awk '{ split($2, arr, "_"); $2=arr[1]"_"arr[2] }1' file

Printing the fields manually using print is rarely ever needed when you are modifying just one of the records. Doing a { .. }1 re-constructs the whole line based on the modifications based on any of the fields in the line. By virtue of modifying only $2, the whole line is reconstructed with the modification.

Upvotes: 4

Related Questions