pelican_george
pelican_george

Reputation: 971

Regular expression to match a character set but negate a sequence

I'm trying to match a sequence of separators but negate whenever an hyphen as a character before and after as such:

For example [\u002D\u0020] will match all spaces and hyphens.

I have wi-fi

However, I want wi-fi to not be a match since it has a letter character before and after. (e.g. \w+\u002D\w+)

How do I negate a sequence while matching a character set? Also, is \w limited to latin letter characters? Is the engine aware of the cultures, arabic and turkish for example ?

EDIT: Just to explain further what I'm trying to achieve. I want to collect all punctuation and specific characters from a sentence and ignore all words (e.g. -+#$%, etc).

Whenever there's an hyphenated word (e.g. state-of-the-art) I wish to ignore the whole word. "this is# a %state-of-the-art design" I intend to get the following collection: "#, %".

Upvotes: 2

Views: 701

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626950

Try matching all hyphenated words and match and capture non-word chars in all other contexts using XRegExp:

var s = "this is# a %statè-òf-thè-árt or state-of-the-art design";
var rx = XRegExp("\\p{L}+(?:-\\p{L}+)+|([^\\p{L}\\p{N}_ ])","g");
var res = [];
XRegExp.forEach(s, rx, function(match, i) {
    if (match[1]) res.push(match[1]);
});
console.log(res);
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/2.0.0/xregexp-all-min.js"></script>

The pattern matches:

  • \\p{L}+(?:-\\p{L}+)+ one or more letters (\\p{L}+) followed with 1 or more sequences of - and 1+ letters again
  • | - or
  • ([^\\p{L}\\p{N}_ ]) - Group 1 capturing one char other than space, _, letters (\\p{L}) and digits (\\p{N}).

Only the contents of Group 1 should be pushed to the resulting array.

Upvotes: 1

Related Questions