Reputation: 3022
I have a file on which I am trying to use awk
to remove the text before the ()
, but keep the text in the ()
. I am also trying to remove the whitespace and text after the _#
and then output the entire line. Maybe sed
is a better choice, but I am not certain how.
file
chr4 100009839 100009851 426_1201_128(ADH5)_1 0 -
chr4 100006265 100006367 426_1202_128(ADH5)_2 0 -
chr4 100003125 100003267 426_1203_128(ADH5)_3 0 -
desired output
chr4 100009839 100009851 ADH5_1
chr4 100006265 100006367 ADH5_2
chr4 100003125 100003267 ADH5_3
awk
awk -F'()_*' '{print $1,$2,$3,$4}' file
Upvotes: 0
Views: 285
Reputation: 52112
Using sed with a substitution:
$ sed 's/[^ ]*(\([^)]*\))\(_[^ ]*\).*$/\1\2/' infile
chr4 100009839 100009851 ADH5_1
chr4 100006265 100006367 ADH5_2
chr4 100003125 100003267 ADH5_3
Taking apart the regex:
[^ ]*( # Non-spaces up to and including opening parenthesis
\( # Start first capture group
[^)]* # Content between parentheses: everything but a closing parenthesis
\) # End of first capture group
) # Closing parenthesis, not captured
\( # Start second capture group
_[^ ]* # Underscore and non-spaces, '_1' etc.
\) # End of second capture group
.*$ # Rest of line, not captured
Upvotes: 1
Reputation: 88563
awk -F'[\t()]' '{OFS="\t"; print $1, $2, $3, $5 $6}' file
Output:
chr4 100009839 100009851 ADH5_1 chr4 100006265 100006367 ADH5_2 chr4 100003125 100003267 ADH5_3
Upvotes: 1