Reputation: 171
So for matching all words in a page of text, I'm using this:
new RegExp("([a-zA-Z0-9\-]+)","ig");
The issue is, some of the things I need to match might be two words, like "green tea" for instance. So I tried this:
var pattern = new RegExp("([a-zA-Z0-9\-?]+\\s[a-zA-Z0-9\-_]+)","ig");
but the issue is that it doesn't match every single double word combination, so maybe it'll match "in green", "tea leaves". I think thats how it works at least, all I know is it doesn't match "green tea".
Upvotes: 0
Views: 69
Reputation: 1074138
There's no magic pill here, because there's no way for a regex engine to know that the words "green tea" go together but "in green" don't, so you'll need to list all of the word combinations you want it to treat as a unit — or do something before-or-after-the-fact instead.
For instance, this will match words in text but treat "green tea" as a single match:
var rex = /(green tea)|([a-zA-Z0-9\-']+)/ig;
var str = "I like green tea, don't you?";
console.log(str.match(rex));
The |
is an alternation meaning "try to match any of these alternatives" (earlier alternatives are preferred to later ones).
Obviously that would get cumbersome really quickly, though, so you may need to look beyond regex, either pre-processing or post-processing to handle your list of desired two-word "words."
Note: I added '
to the second half of that, since otherwise "don't" was read as "don" and "t".
Upvotes: 2
Reputation: 35159
First, as always, regexp101 is your friend! :)
Second, [a-zA-z0-9] is equivalent to \w. And if you want to add dashes and question marks your definition of a 'word' (as it appears you do), you can use [\w-?].
Finally, you probably want a non-capturing group like this:
'((?:[\w-?]+(?:\s[\w-?]+)*))'
which says "find a word, followed by zero or more 'space characters + word' groups".
Tweak in regex101 to taste.
Hope this helps!
Upvotes: 1