Azevedo
Azevedo

Reputation: 2189

RegEx consecutive matches

I have this regex in Javascript to remove words with 3 letters or less:

srcText = srcText.replace(/\s[a-z]{1,3}\s/gi,'');

It works but when two consecutives matches are found, the 2nd isn't affected:

Ex.:

"... this is one sample of a text ... "

' one ' and ' a ' won't be affected unless I run the code one more time:

srcText = srcText.replace(/\s[a-z]{1,3}\s/gi,'');

So I'd have to run the code n times, n being the consecutives matches in srcText.

for testing purpose:

http://regexpal.com/

sample text:

http://www.gutenberg.org/files/521/521-0.txt (say, 4th paragraph)

Is my regex missing something or javascript won't allow this kind of recursiveness?

Upvotes: 3

Views: 729

Answers (2)

Dave
Dave

Reputation: 46359

JavaScript's regular expressions (and most others too) support the \b escape sequence, which matches (zero-width) word boundaries. In your expression, simply replace the two \s with \b and it will work.

Note that "word boundary" also applies around dashes, dots, etc. So this-test - more. will have boundaries at: |this|-|test| - |more|. Usually this is desirable, but it is a difference in behaviour from \s which is worth knowing about.

As noted by Sam in the comments, a word boundary is identified as:

(^\w|\w\W|\W\w|\w$)

that is, a non-word character followed by a word character, or a word character followed by a non-word character, where the start and end of the string are taken as non-word characters. (but note that \b is zero-width, so it isn't just a shorthand for that expression)

Upvotes: 6

Evan Kennedy
Evan Kennedy

Reputation: 4185

The regular expression is failing because you require a space between each word and the regex search is non-overlapping. The regular expression essentially starts looking for a space, a 1-3 letter word, then another space. It identifies the first one at is. Since the space after is is taken up by the first match, one isn't able to match because it doesn't contain a space before it. The regex matches like this:

... thisisone sampleofa text ...

An easy way to fix this is to change \s to \b. \b looks for a word break which includes spaces but it doesn't include the actual character in the match. So the regular expression \b[a-z]{1,3}\b would match like this:

... this is one sample of a text ...

This now finds all three letter words and can be used like this to replace all words:

> var str = "... this is one sample of a text ... ";
> `srcText = srcText.replace(/\b[a-z]{1,3}\b/gi,'');`
  "... this   sample   text ... "

However...

This includes extra spaces where words have been removed. If you want those spaces removed and are certain you will not have extra spacing, go ahead and use something which matches spacing after but not before. That way an equal amount of spaces will be removed as there are words. The regex would look like: \b[a-z]{1,3}\s

If you need something more complex, let me know.

Upvotes: 1

Related Questions