pitamer
pitamer

Reputation: 1064

Match words that consist of specific characters, excluding between special brackets

I'm trying to match words that consist only of characters in this character class: [A-z'\\/%], excluding cases where:

So, say I've got this funny string:

[beginning]<start>How's {the} /weather (\\today%?)[end]

I need to match the following strings:

[ "How's", "/weather", "\\today%" ]

I've tried using this pattern:

/[A-z'/\\%]*(?![^{]*})(?![^\[]*\])(?![^<]*>)/gm

But for some reason, it matches:

[ "[beginning]", "", "How's", "", "", "", "/weather", "", "", "\\today%", "", "", "[end]", "" ]

I'm not sure why my pattern allows stuff between [ and ], since I used (?![^\[]*\]), and a similar approach seems to work for not matching {these cases} and <these cases>. I'm also not sure why it matches all the empty strings.

Any wisdom? :)

Upvotes: 8

Views: 314

Answers (3)

The fourth bird
The fourth bird

Reputation: 163207

You can match all the cases that you don't want using an alternation and place the character class in a capturing group to capture what you want to keep.

The [^ is a negated character class that matches any character except what is specified.

(?:\[[^\][]*]|<[^<>]*>|{[^{}]*})|([A-Za-z'/\\%]+)

Explanation

  • (?: Non capture group
    • \[[^\][]*] Match from opening till closing []
    • | Or
    • <[^<>]*> Match from opening till closing <>
    • | Or
    • {[^{}]*} Match from opening till closing {}
  • ) Close non capture group
  • | Or
  • ([A-Za-z'/\\%]+) Repeat the character class 1+ times to prevent empty matches and capture in group 1

Regex demo

const regex = /(?:\[[^\][]*]|<[^<>]*>|{[^{}]*})|([A-Za-z'/\\%]+)/g;
const str = `[beginning]<start>How's {the} /weather (\\\\today%?)[end]`;
let m;

while ((m = regex.exec(str)) !== null) {
  if (m[1] !== undefined) console.log(m[1]);
}

Upvotes: 1

Taufik Nurrohman
Taufik Nurrohman

Reputation: 3409

Split it with regular expression:

let data = "[beginning]<start>How's {the} /weather (\\today%?)[end]";
let matches = data.split(/\s*(?:<[^>]+>|\[[^\]]+\]|\{[^\}]+\}|[()])\s*/);

console.log(matches.filter(v => "" !== v));

Upvotes: 1

41686d6564
41686d6564

Reputation: 19641

There are essentially two problems with your pattern:

  1. Never use A-z in a character class if you intend to match only letters (because it will match more than just letters1). Instead, use a-zA-Z (or A-Za-z).

  2. Using the * quantifier after the character class will allow empty matches. Use the + quantifier instead.

So, the fixed pattern should be:

[A-Za-z'/\\%]+(?![^{]*})(?![^\[]*\])(?![^<]*>)

Demo.


1 The [A-z] character class means "match any character with an ASCII code between 65 and 122". The problem with that is that codes between 91 and 95 are not letters (and that's why the original pattern matches characters like '[' and ']').

Upvotes: 4

Related Questions