Reputation: 99
I have a file that contains rows with various bits of information. Unfortunately, the file was compiled from many different sources so some of the rows have more information than others:
1_10090 Mus musculus (house mouse) Gene 1
2_10090 Mus musculus Gene 2
3_10090 (house mouse) Gene 3
4_10090 Gene 4
My desired output is:
1_10090 Mus musculus (house mouse) Gene 1
2_10090 Mus musculus (house mouse) Gene 2
3_10090 Mus musculus (house mouse) Gene 3
4_10090 Mus musculus (house mouse) Gene 4
I thought to use sed to replace every line that contains _10090 followed by a character other than 'M' with 'Mus musculus' (and then do the same for those that don't contain (house mouse)):
sed 's/_10090 [^M]/_10090 Mus musculus /g'
But this, of course, replaces the character after 10090 which I want to retain:
1_10090 Mus musculus (house mouse) Gene 1
2_10090 Mus musculus Gene 2
3_10090 Mus musculus house mouse) Gene 3
4_10090 Mus musculus ene 4
How might I go about ensuring that every line starts with
#_10090 Mus musculus (house mouse)
where # is different between every row and some lines already contain this string and others contain parts of it?
Thanks in advance!
Upvotes: 1
Views: 88
Reputation: 627044
You can use
sed -E 's/_10090( Mus musculus)?( \(house mouse\))? /_10090 Mus musculus (house mouse) /g' file
The approach is slightly different since your described logic does not help achieve the expected results. Namely, I match an optional ' Mus musculus'
and then an optional ' (house mouse)'
substrings after a _10090
substring (with a space after) and add (back, if they were present) these substrings in the substitution pattern.
See an online demo:
#!/bin/bash
s='1_10090 Mus musculus (house mouse) Gene 1
2_10090 Mus musculus Gene 2
3_10090 (house mouse) Gene 3
4_10090 Gene 4'
sed -E 's/_10090( Mus musculus)?( \(house mouse\))? /_10090 Mus musculus (house mouse) /g' <<< "$s"
Output:
1_10090 Mus musculus (house mouse) Gene 1
2_10090 Mus musculus (house mouse) Gene 2
3_10090 Mus musculus (house mouse) Gene 3
4_10090 Mus musculus (house mouse) Gene 4
Upvotes: 2