Reputation: 103
I have a bunch of XML files containing texts (transcriptions of a diary). At the end of sentences, the requirement is that there be two whitespaces after the period. At the moment, this is partially done, but not in all cases: sometimes there is only a single whitespace after the period before the first character of the next sentence.
I'm using Gitbash for Windows, and think that sed is the command to use, but I don't know the correct regular expression. I think I need to find:
period whitespace [some other character]
and replace with
period whitespace whitespace [the same next character]
For example, right now we have this:
<p>The spacing after this sentence (two whitespaces) is what is required. By contrast, this sentence has only a single space after the period. This is the next sentence, the last in a paragraph, which correctly has no whitespace at all after the period.</p>
What I need is this, where each period is followed by two whitespaces, apart from the last in the paragraph.
<p>The double whitespace after this sentence is what is required. This sentence now also has a double space after the period. This is the next sentence, the last in a paragraph, which correctly has no whitespace at all after the period.</p>
Upvotes: 1
Views: 82
Reputation: 6272
sed
is a little bit limited (can you use grep
or perl
?) anyway you can use a regex like this (GNU sed specific):
sed -i -r 's/\. ([^ ])/. \1/g' <file>
Legenda
-i # sed switch: replace inplace in the file passed as parameter
-r # use extended regex
/\. ([^ ]) # match a single dot followed by a space and by a not-space
/. \1/ # replace with a dot follower by 2 spaces and by the previous non-space char
g # apply multiple times per line
The regex could be refined if needed with more test cases.
As evidenced by @ghoti the answer was GNU sed specific. I think a more general approach (without extended regex and inplace) could be:
sed 's/\. \([^ ]\)/. \1/g' <input.file> > <output.file>
Upvotes: 1
Reputation: 29178
With sed
you can do this:
sed -e "s/\. \</\. /"
Here are the changes
$ sed -e "s/\. \b/\. /g" test.txt > fixed.txt
$ diff test.txt fixed.txt
1c1
< <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas vehicula placerat nisl, bibendum blandit tortor pharetra ut. Morbi nec tellus ultrices, porta felis et, dapibus diam. Phasellus vehicula ante ac urna elementum lacinia.</p>
---
> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas vehicula placerat nisl, bibendum blandit tortor pharetra ut. Morbi nec tellus ultrices, porta felis et, dapibus diam. Phasellus vehicula ante ac urna elementum lacinia.</p>
Upvotes: -1
Reputation: 349
You want to find all the white space occurrences after a dot and remember the next character. Then replace with ". " and whatever the remembered character was. The remembering part is called a "tagged expression".
So, search for \. +([^ ])
which means "dot, some spaces, [tagged expression]something that isn't a space[end tagged expression]"
Replace it with . \1
Here's a sed example:
$ echo '>zzz. xxx. yyy.<' | sed -r -e 's/\. +([^ ])/. \1/g'
>zzz. xxx. yyy.<
Upvotes: 1
Reputation: 96018
You can use perl
:
perl -pe 's-\. (?! )-\. -g' test
Example:
$ cat test
This is. A simple. Test to check. That it works!
$ perl -pe 's-\. (?! )-\. -g' test
This is. A simple. Test to check. That it works!
The regex \. (?! )
matches a period, followed by a space, that's not followed by another space.
Upvotes: 0