Reputation: 29869

Match all characters up until a word boundary

Based off Regex Until But Not Including, I'm trying to match all characters up until a word boundary.

For example - matching apple in the following string:

apple<

I'm doing that using:

a negated set [^]
with a word boundary \b
and a plus + repeater

Like this:

/a[^\b]+/

Which should look for an "a" and then grab one or more matches for any character that is not a word boundary. So I would expect it to stop before < which is at the end of the word

Demo in Regexr

Demo in StackSnippets

var input = [ "apple<", "apple/" ];
var myRegex = /a[^\b]+/;

for (var i = 0; i < input.length; i++) {
  console.log(myRegex.exec(input[i]));  
}

Couple other regex strings I tried:

I can use a negated word boundary or a negated set with a regular word boundary:

/a[\B]+/
/a[^\b]+/

I can specify several possible word ending characters and use them in a negated set:

/a[^|"<>\-\\\/;:,.]+/

I can also look for a postive set and just restrict it to return for regular letters:

/a[\w]+/
/a[a-zA-Z]+/

But I'd like to know how to do it for a word boundary if that's possible.

Here's a MDN's listing of word boundary and the characters that it constitutes

Upvotes: 0

Answers (3)

Jason Cust

Reputation: 10899

If this rewording of the question is accurate: match all words beginning with 'a', then you might have begun the search with existing SO answers like this one. Distilling that down you could use a character class for a word \w and to make it a bit more bulletproof by including a preceding word boundary \b match to prevent matching partial words including an 'a' such as 'baggage': /\ba\w+/gi

var input = [ "apple<", "apple/", "baggage;" ];
var myRegexWord = /\ba\w+/i;
var myRegexPartial = /a\w+/;

for (var i = 0; i < input.length; i++) {
  console.log(myRegexWord.exec(input[i]));  
  console.log(myRegexPartial.exec(input[i]));  
}

Upvotes: 1

Touffy

Reputation: 6561

Word boundaries (\b) are not characters, but the empty string between a sequence of letters and any non-letter character. Moreover, since Unicode support is still lacking in JavaScript, "letter" mean only ASCII letters.

Because of that, you

generally shouldn't use \b unless your data is some kind of computer language that can't possibly include Unicode
can't apply quantifiers to \b (an empty string times 10 is still one empty string)
can't negate \b (it's not a character set, so it has no complement)
can't include \b in a character set (in square brackets) since, again, it's not a character or character set

Since \b doesn't actually add any characters to the match, you can safely append it to your regex:

/.+?\b/

will match all characters up until the first word boundary. It's in fact a superset of:

/\w+/

which is probably what you want, since you're interested only in the words, not the stuff in between.

Upvotes: 6

Federico Piazza

Reputation: 30985

You have to include the word boundary as part of your regex like this:

/[A-Za-z]+\b/

Working demo

You could also use:

\w+\b

Although this will include the underscore as part of your word

Upvotes: 1

Match all characters up until a word boundary

Demo in Regexr

Demo in StackSnippets

Answers (3)

Related Questions