Reputation: 331
I am trying to covert input file content of this:
NP_418770.2: 257-296 344-415 503-543 556-592 642-707
YP_026226.4: 741-779 811-890 896-979 1043-1077
to this:
NP_418770.2: 257-296, 344-415, 503-543, 556-592, 642-707
YP_026226.4: 741-779, 811-890, 896-979, 1043-1077
i.e., replace a space with comma and space (excluding newline)
For that, I have tried:
perl -pi.bak -e "s/[^\S\n]+/, /g" input.txt
but it gives:
NP_418770.2:, 257-296, 344-415, 503-543, 556-592, 642-707
YP_026226.4:, 741-779, 811-890, 896-979, 1043-1077
how can I stop the additional comma which appear after ":" (I want ":" and a single space) without writing another regex?
Thanks
Upvotes: 8
Views: 2570
Reputation: 89584
You can play with the word-boundary to discard the space that follows the colon: s/\b\h+/, /g
It can be done with perl:
perl -pe's/\b\h+/, /g' file
but also with sed:
sed -E 's/\b[ \t]+/, /g' file
Other approach that uses the field separator:
perl -F'\b\h+' -ape'BEGIN{$,=", "}' file
or do the same with awk:
awk -F'\b[ \t]+' -vOFS=', ' '1' file
Upvotes: 4
Reputation: 7672
Try using regex negative lookbehind. It is basically look if the character before the space is colon (:
) then it don't match that space.
s/(?<!:)[^\S\n]+/, /g
Upvotes: 10
Reputation: 1030
You were close. That should do the trick:
s/(\d+-\d+)[^\S\n]+/$1, /g
The thing is, I try to look at the parts that will get a comma after them which apply to the pattern of "digits, then a dash, more digits, then a whitespace that's not a newline". The funny thing about it is that I said that "whitespace that's not a newline" part as [^\S\n]+
which means "not a non-whitespace or a newline" (because \S
is all that's not \s
and we want to exclude the newline too). If in any case you have some trailing whitespace, you can trim it with s/\s+$//
prior to the regex above, just don't forget to add the newline character back after that.
Upvotes: 2