Billy Mills
Billy Mills

Reputation: 99

Inserting text to make every row start the same way

I have a file that contains rows with various bits of information. Unfortunately, the file was compiled from many different sources so some of the rows have more information than others:

1_10090 Mus musculus (house mouse) Gene 1
2_10090 Mus musculus Gene 2
3_10090 (house mouse) Gene 3
4_10090 Gene 4

My desired output is:

1_10090 Mus musculus (house mouse) Gene 1
2_10090 Mus musculus (house mouse) Gene 2
3_10090 Mus musculus (house mouse) Gene 3
4_10090 Mus musculus (house mouse) Gene 4

I thought to use sed to replace every line that contains _10090 followed by a character other than 'M' with 'Mus musculus' (and then do the same for those that don't contain (house mouse)):

sed 's/_10090 [^M]/_10090 Mus musculus /g'

But this, of course, replaces the character after 10090 which I want to retain:

1_10090 Mus musculus (house mouse) Gene 1
2_10090 Mus musculus Gene 2
3_10090 Mus musculus house mouse) Gene 3
4_10090 Mus musculus ene 4

How might I go about ensuring that every line starts with

#_10090 Mus musculus (house mouse)

where # is different between every row and some lines already contain this string and others contain parts of it?

Thanks in advance!

Upvotes: 1

Views: 88

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627044

You can use

sed -E 's/_10090( Mus musculus)?( \(house mouse\))? /_10090 Mus musculus (house mouse) /g' file

The approach is slightly different since your described logic does not help achieve the expected results. Namely, I match an optional ' Mus musculus' and then an optional ' (house mouse)' substrings after a _10090 substring (with a space after) and add (back, if they were present) these substrings in the substitution pattern.

See an online demo:

#!/bin/bash
s='1_10090 Mus musculus (house mouse) Gene 1
2_10090 Mus musculus Gene 2
3_10090 (house mouse) Gene 3
4_10090 Gene 4'
sed -E 's/_10090( Mus musculus)?( \(house mouse\))? /_10090 Mus musculus (house mouse) /g' <<< "$s"

Output:

1_10090 Mus musculus (house mouse) Gene 1
2_10090 Mus musculus (house mouse) Gene 2
3_10090 Mus musculus (house mouse) Gene 3
4_10090 Mus musculus (house mouse) Gene 4

Upvotes: 2

Related Questions