KyleMit
KyleMit

Reputation: 29869

Match all characters up until a word boundary

Based off Regex Until But Not Including, I'm trying to match all characters up until a word boundary.

For example - matching apple in the following string:

apple<

I'm doing that using:

Like this:

/a[^\b]+/

Which should look for an "a" and then grab one or more matches for any character that is not a word boundary. So I would expect it to stop before < which is at the end of the word

Demo in Regexr

Demo in StackSnippets

var input = [ "apple<", "apple/" ];
var myRegex = /a[^\b]+/;

for (var i = 0; i < input.length; i++) {
  console.log(myRegex.exec(input[i]));  
}

Couple other regex strings I tried:

I can use a negated word boundary or a negated set with a regular word boundary:

I can specify several possible word ending characters and use them in a negated set:

I can also look for a postive set and just restrict it to return for regular letters:

But I'd like to know how to do it for a word boundary if that's possible.

Here's a MDN's listing of word boundary and the characters that it constitutes

Upvotes: 0

Views: 3098

Answers (3)

Jason Cust
Jason Cust

Reputation: 10899

If this rewording of the question is accurate: match all words beginning with 'a', then you might have begun the search with existing SO answers like this one. Distilling that down you could use a character class for a word \w and to make it a bit more bulletproof by including a preceding word boundary \b match to prevent matching partial words including an 'a' such as 'baggage': /\ba\w+/gi

var input = [ "apple<", "apple/", "baggage;" ];
var myRegexWord = /\ba\w+/i;
var myRegexPartial = /a\w+/;

for (var i = 0; i < input.length; i++) {
  console.log(myRegexWord.exec(input[i]));  
  console.log(myRegexPartial.exec(input[i]));  
}

Upvotes: 1

Touffy
Touffy

Reputation: 6561

Word boundaries (\b) are not characters, but the empty string between a sequence of letters and any non-letter character. Moreover, since Unicode support is still lacking in JavaScript, "letter" mean only ASCII letters.

Because of that, you

  • generally shouldn't use \b unless your data is some kind of computer language that can't possibly include Unicode
  • can't apply quantifiers to \b (an empty string times 10 is still one empty string)
  • can't negate \b (it's not a character set, so it has no complement)
  • can't include \b in a character set (in square brackets) since, again, it's not a character or character set

Since \b doesn't actually add any characters to the match, you can safely append it to your regex:

/.+?\b/

will match all characters up until the first word boundary. It's in fact a superset of:

/\w+/

which is probably what you want, since you're interested only in the words, not the stuff in between.

Upvotes: 6

Federico Piazza
Federico Piazza

Reputation: 30985

You have to include the word boundary as part of your regex like this:

/[A-Za-z]+\b/

Working demo

You could also use:

\w+\b

Although this will include the underscore as part of your word

Upvotes: 1

Related Questions