timtimbruno
timtimbruno

Reputation: 61

removing multiple instances of a string in a line with sed

I have a large tab delimited file that I'd like keep only a certain string (GO:#######) that appears multiple (and variable) times in each line as well as lines that are blank containing a period. When I use SED to replace all the non-GO strings it removes the entire middle of the line. How do I prevent this?

SED command I'm using and other permutations

sed -r 's/\t`.+`\t//g' file1.txt > file2.txt

What I have

GO:1234567    `text1`moretext`    GO:5373845    `diff`text`     GO:5438534     `text`text
.
GO:3333333     `txt`text`    GO:5553535    `misc`text
.
.

What I'd like

GO:1234567    GO:5373845    GO:5438534
.
GO:3333333    GO:5553535
.
.

What I get

GO:1234567    GO:5438534     `text`text
.
GO:3333333    GO:5553535    `misc`text
.
.

Upvotes: 3

Views: 256

Answers (5)

The fourth bird
The fourth bird

Reputation: 163352

This pattern \t`.+`\t matches from a tab followed by ` till the last occurrence of that same pattern, which matches too much.

There don't seem to be any spaces in the parts that start with a backtick that you want to remove.

I think awk is better suited for this task, but in that case with sed you can remove all strings that start with a backtick ` followed by non whitespace characters.

If you remove multiple consecutive fields, or a field at the start or end, there can occur gaps with multiple tabs that you can also replace with an empty string.

sed -E 's/(\t|^)`[^[:space:]]*//g;s/^\t+|\t+$|//g;s/\t{2,}/\t/g' file

The tab delimited content of file

GO:1234567  `text1`moretext`    GO:5373845  `diff`text` GO:5438534  `text`text
.
GO:3333333  `txt`text`  GO:5553535  `misc`text
..
`txt`text`  GO:3333333  `txt`text`  `txt`text`  `txt`text`  GO:5553535  `misc`text  `misc`text

Output

GO:1234567      GO:5373845      GO:5438534
.
GO:3333333      GO:5553535
..
GO:3333333      GO:5553535

Upvotes: 0

anubhava
anubhava

Reputation: 785156

This awk solution would work with any version of awk:

awk '
BEGIN {
   FS=OFS="\t"
}
{
   for (i=1; i<=NF; ++i)
      if ($i ~ /^GO:/)
         s = (s ? s OFS : "") $i
   print s
   s = ""
}' file

GO:1234567  GO:5373845  GO:5438534
GO:3333333  GO:5553535
GO:3333333

Upvotes: 2

Ted Lyngmo
Ted Lyngmo

Reputation: 117298

sed -E 's/\t`[^\t]*//g'
  • \t- tab
  • ` - a literal backtick
  • [^\t]* - any non-tab character 0 or more times

Alternative:

sed -E 's/\t(`[^`]*){2}`?//g'
  • \t - tab
  • ( - start of group
    • ` - a literal backtick
    • [^`]* - any non-backticks 0 or more times
  • ) - end of group
  • {2} - repeat group twice
  • `? - an optional backtick (since the last column only has 2 instead of 3)

... and substitute with an empty string.

Output:

GO:1234567      GO:5373845      GO:5438534
.
GO:3333333      GO:5553535
.
.

Note: These examples assumes that there is exactly one tab between columns. It's hard to see here.

Upvotes: 2

Cyrus
Cyrus

Reputation: 88626

With GNU awk:

awk 'BEGIN{FPAT="GO:[0-9]+"; OFS="\t"} {$1=$1; print}' file

Output is tab delimited:

GO:1234567  GO:5373845  GO:5438534

GO:3333333  GO:5553535

From man awk:

FPAT: A regular expression describing the contents of the fields in a record. When set, gawk parses the in‐ put into fields, where the fields match the regular expression, instead of using the value of FS as the field separator.

See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Upvotes: 3

KamilCuk
KamilCuk

Reputation: 141020

I would match explicitly non `.

s/`[^`]*`[^`]*`//

Regex is greedy, `.+` matches anything, from the first backtick up to the last backtick.

Upvotes: 0

Related Questions