Regular expression to match a character set but negate a sequence

Question

I'm trying to match a sequence of separators but negate whenever an hyphen as a character before and after as such:

For example [\u002D\u0020] will match all spaces and hyphens.

I have wi-fi

However, I want wi-fi to not be a match since it has a letter character before and after. (e.g. \w+\u002D\w+)

How do I negate a sequence while matching a character set? Also, is \w limited to latin letter characters? Is the engine aware of the cultures, arabic and turkish for example ?

EDIT: Just to explain further what I'm trying to achieve. I want to collect all punctuation and specific characters from a sentence and ignore all words (e.g. -+#$%, etc).

Whenever there's an hyphenated word (e.g. state-of-the-art) I wish to ignore the whole word. "this is# a %state-of-the-art design" I intend to get the following collection: "#, %".

Wiktor Stribiżew · Accepted Answer

Try matching all hyphenated words and match and capture non-word chars in all other contexts using XRegExp:

var s = "this is# a %statè-òf-thè-árt or state-of-the-art design";
var rx = XRegExp("\p{L}+(?:-\p{L}+)+|([^\p{L}\p{N}_ ])","g");
var res = [];
XRegExp.forEach(s, rx, function(match, i) {
    if (match[1]) res.push(match[1]);
});
console.log(res);

The pattern matches:

\p{L}+(?:-\p{L}+)+ one or more letters (\p{L}+) followed with 1 or more sequences of - and 1+ letters again
| - or
([^\p{L}\p{N}_ ]) - Group 1 capturing one char other than space, _, letters (\p{L}) and digits (\p{N}).

Only the contents of Group 1 should be pushed to the resulting array.

Regular expression to match a character set but negate a sequence

Answers (1)

Related Questions