Reputation: 2813
I'm working on a news aggregator, and my local news RSS feed has some issues with how content is sent. For example, any time there is a quote, the space between a full stop and the first letter of the next sentence is removed: like.This
.
I tried using str.replace('.', '. ')
, however the issue is that because there is sometimes a space, you end up with two spaces in some sentences. How can I normalise the number of spaces?
Another issue is that this is for a long article, so ideally it needs to be quite fast (either that, or I'll just have to implement it async).
Upvotes: 2
Views: 3212
Reputation: 42109
Look for a period (\.
) followed by 0 or more spaces ([ ]*
):
str = str.replace( /\.(?=[^\d])[ ]*/g , '. ')
\
) otherwise, it is a pattern for any character(?=[^\d])
looks ahead without matching the character following the space, in this case we want to make sure the next character is not a number, in order to avoid putting a space in the middle of a number (e.g., 3.4 or just .5)[ ]
looks for any character inside the square brackets, in this case a space. I put it in square brackets, because you might find that some devices might use different encodings resulting in different space characters to match. They may look the same on the screen, but have different matching values; for instance, unicode character. When that occurs you'll have to add the new weird space character to this bracket with a copy/paste*
) is used to denote a 0-or-more match condition, a plus (+
) would be used to mean 1-or-more (must exist at least once). This is important in case you have 2 or 3 spaces following a period. We use the asterisk instead of the plus to solve for cases where this no space after the sentence (e.g., "Sentence.Next sentence.")g
at the end means match globally, or search the whole string and apply it to all matches is found, otherwise it stop at your first period'. '
at the end is what we're replacing, in this case one period (or full stop) followed by a spaceRegarding your requirement, fast is a relative term. It used to be completing a task within a day was fast, then hours was fast, etc. It all depends on what you consider fast. In this case, the amount of material, memory, and processing power will affect the time to process; but I'd say, in general, it is fast.
var demo = document.getElementById('demo'),
out = document.getElementById('out');
out.textContent = demo.textContent.replace(/\.(?=[^\d])[ ]*/g, '. ');
<div><pre id="demo" style="white-space:pre-wrap">This is a sentence followed by multiple spaces. Followed by no spaces.That contains the number 1.0, which we don't want to separate.With no space before it.</pre></div>
<div><pre id="out" style="white-space:pre-wrap"></pre></div>
Upvotes: 4
Reputation: 253318
My own suggestion would be:
.replace(/\.(\S)/g, '. $1')
var input = document.querySelector('p.input'),
output = document.querySelector('p.output');
output.textContent = input.textContent.replace(/\.(\S)/g, '. $1');
<p class="input">Donec malesuada rhoncus massa, eu imperdiet tellus rhoncus ac.Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.</p>
<p class="output"></p>
This looks for, and captures (using the parentheses), a non-whitespace (\S
) character following a period (\.
, escaped because the .
character is special in regular expressions, representing 'any character'). This looks globally (g
) throughout the supplied string and then replaces that non-whitespace-character with the captured match ($1
) prefixed with a space.
In the event that you might have to deal with decimal numbers in these strings, though, I'd amend the above to:
.replace(/\.([^\s\d])/g, '. $1')
var input = document.querySelector('p.input'),
output = document.querySelector('p.output');
output.textContent = input.textContent.replace(/\.([^\s\d])/g, '. $1');
<p class="input">Donec malesuada rhoncus massa, eu imperdiet tellus rhoncus ac.Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.</p>
<p class="output"></p>
Which does exactly the same, but searches instead for not whitespace-or-digits ([^\s\d]
) following the period character.
References:
Upvotes: 4
Reputation: 12079
Using a regular expression:
str = str.replace(/\. ?/g, '. ');
The space after the (escaped) period is made optional by use of the question mark following it.
For a quick test, open this jsfiddle, open your browser console, and run:
http://jsfiddle.net/BloodyKnuckles/7wdv9csL/
If the number of spaces can be zero to any number then use:
str = str.replace(/\. */g, '. ');
In this case the asterisk (*) means there can be zero or more spaces.
Upvotes: 0
Reputation: 6145
Try using this code to loop through each character in the string:
for (var i = 0, len = str.length; i < len; i++) {
if (str[i] == '.') {
if (str[++i] != ' ') {
str = [str.slice(0, i), str.slice(i)].join(' ');
i++;
len++;
}
}
}
Upvotes: -1