Reputation: 61
I have a large tab delimited file that I'd like keep only a certain string (GO:#######) that appears multiple (and variable) times in each line as well as lines that are blank containing a period. When I use SED to replace all the non-GO strings it removes the entire middle of the line. How do I prevent this?
SED command I'm using and other permutations
sed -r 's/\t`.+`\t//g' file1.txt > file2.txt
What I have
GO:1234567 `text1`moretext` GO:5373845 `diff`text` GO:5438534 `text`text
.
GO:3333333 `txt`text` GO:5553535 `misc`text
.
.
What I'd like
GO:1234567 GO:5373845 GO:5438534
.
GO:3333333 GO:5553535
.
.
What I get
GO:1234567 GO:5438534 `text`text
.
GO:3333333 GO:5553535 `misc`text
.
.
Upvotes: 3
Views: 256
Reputation: 163352
This pattern \t`.+`\t
matches from a tab followed by `
till the last occurrence of that same pattern, which matches too much.
There don't seem to be any spaces in the parts that start with a backtick that you want to remove.
I think awk is better suited for this task, but in that case with sed
you can remove all strings that start with a backtick `
followed by non whitespace characters.
If you remove multiple consecutive fields, or a field at the start or end, there can occur gaps with multiple tabs that you can also replace with an empty string.
sed -E 's/(\t|^)`[^[:space:]]*//g;s/^\t+|\t+$|//g;s/\t{2,}/\t/g' file
The tab delimited content of file
GO:1234567 `text1`moretext` GO:5373845 `diff`text` GO:5438534 `text`text
.
GO:3333333 `txt`text` GO:5553535 `misc`text
..
`txt`text` GO:3333333 `txt`text` `txt`text` `txt`text` GO:5553535 `misc`text `misc`text
Output
GO:1234567 GO:5373845 GO:5438534
.
GO:3333333 GO:5553535
..
GO:3333333 GO:5553535
Upvotes: 0
Reputation: 785156
This awk
solution would work with any version of awk
:
awk '
BEGIN {
FS=OFS="\t"
}
{
for (i=1; i<=NF; ++i)
if ($i ~ /^GO:/)
s = (s ? s OFS : "") $i
print s
s = ""
}' file
GO:1234567 GO:5373845 GO:5438534
GO:3333333 GO:5553535
GO:3333333
Upvotes: 2
Reputation: 117298
sed -E 's/\t`[^\t]*//g'
\t
- tab`
- a literal backtick[^\t]*
- any non-tab character 0 or more timesAlternative:
sed -E 's/\t(`[^`]*){2}`?//g'
\t
- tab(
- start of group
`
- a literal backtick[^`]*
- any non-backticks 0 or more times)
- end of group{2}
- repeat group twice`?
- an optional backtick (since the last column only has 2 instead of 3)... and substitute with an empty string.
Output:
GO:1234567 GO:5373845 GO:5438534
.
GO:3333333 GO:5553535
.
.
Note: These examples assumes that there is exactly one tab between columns. It's hard to see here.
Upvotes: 2
Reputation: 88626
With GNU awk
:
awk 'BEGIN{FPAT="GO:[0-9]+"; OFS="\t"} {$1=$1; print}' file
Output is tab delimited:
GO:1234567 GO:5373845 GO:5438534 GO:3333333 GO:5553535
From man awk
:
FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the in‐ put into fields, where the fields match the regular expression, instead of using the value of FS as the field separator.
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
Upvotes: 3
Reputation: 141020
I would match explicitly non `.
s/`[^`]*`[^`]*`//
Regex is greedy, `.+`
matches anything, from the first backtick up to the last backtick.
Upvotes: 0