justaguy
justaguy

Reputation: 3022

awk or sed to remove text in file before character and then after character

I have a file on which I am trying to use awk to remove the text before the (), but keep the text in the (). I am also trying to remove the whitespace and text after the _# and then output the entire line. Maybe sed is a better choice, but I am not certain how.

file

chr4    100009839   100009851   426_1201_128(ADH5)_1    0   -
chr4    100006265   100006367   426_1202_128(ADH5)_2    0   -
chr4    100003125   100003267   426_1203_128(ADH5)_3    0   -

desired output

chr4    100009839   100009851   ADH5_1  
chr4    100006265   100006367   ADH5_2  
chr4    100003125   100003267   ADH5_3

awk

awk -F'()_*' '{print $1,$2,$3,$4}' file

Upvotes: 0

Views: 285

Answers (2)

Benjamin W.
Benjamin W.

Reputation: 52112

Using sed with a substitution:

$ sed 's/[^ ]*(\([^)]*\))\(_[^ ]*\).*$/\1\2/' infile
chr4    100009839   100009851   ADH5_1
chr4    100006265   100006367   ADH5_2
chr4    100003125   100003267   ADH5_3

Taking apart the regex:

[^ ]*(       # Non-spaces up to and including opening parenthesis
\(           # Start first capture group
    [^)]*    # Content between parentheses: everything but a closing parenthesis
\)           # End of first capture group
)            # Closing parenthesis, not captured
\(           # Start second capture group
    _[^ ]*   # Underscore and non-spaces, '_1' etc.
\)           # End of second capture group
.*$          # Rest of line, not captured

Upvotes: 1

Cyrus
Cyrus

Reputation: 88563

awk -F'[\t()]' '{OFS="\t"; print $1, $2, $3, $5 $6}' file

Output:

chr4    100009839       100009851       ADH5_1
chr4    100006265       100006367       ADH5_2
chr4    100003125       100003267       ADH5_3

Upvotes: 1

Related Questions