duck degen
duck degen

Reputation: 1223

How to convert a PCRE2 regex to JavaScript?

This is the PCRE2 regexp:

(?<=hello )(?:[^_]\w++)++

It's intended use is against strings like the following:

Hello Bob (Marius) Smith. -> Match "Bob"

Hello Bob Jr. (Joseph) White -> Match "Bob Jr."

Hello Bob Jr. IInd (Paul) Jobs -> Match "Bob Jr. IInd"

You get the point.

Essentially there is a magic word, in this case "hello", followed by a first name, followed by a second name which is always between parens. First names could be anything really. A single word, a list of words followed by punctuation, and so on. Heck, look at Elon Musks' kids' name (X Æ A-Xii) to see how weird names can get :)

Let's only assume ascii, though. Æ is not in my targets :)

I'm at a loss on how to convert this Regexp to JS, and the only viable solution I found was to use PCRE2-wasm on node which spins up a wasm virtual machine and sucks up 1gb of resources just for that. That's insane.

Upvotes: 2

Views: 1461

Answers (3)

The fourth bird
The fourth bird

Reputation: 163287

The ++ does not work as Javascript does not support possessive quantifiers.

As a first name, followed by a second name which is always between parens, you might also use a capture group with a match instead of a lookbehind.

\b[Hh]ello (\w+.*?)\s*\([^()\s]+\)
  • \b[Hh]ello Match hello or Hello
  • ( Capture group 1
    • \w.*? Match 1+ word chars followed by any char as least as possible
  • ) Close group 1
  • \s*\([^()\s]*\) Match optional whitespace char followed by ( till )

Regex demo

const regex = /\b[Hh]ello (\w+.*?)\s*\([^()\s]+\)/;
["Hello Bob (Marius) Smith.",
  "Hello Bob Jr. (Joseph) White",
  "Hello Bob Jr. IInd (Paul) Jobs"
].forEach(s => {
  const m = s.match(regex);
  if (m) {
    console.log(m[1]);
  }
})

With the lookbehind, you might also match word characters followed by an optionally repeated capture group matching whitspace chars followed by word characters or a dot.

(?<=[Hh]ello )\w+(?:\s+[\w.]+)*

Regex demo

Upvotes: 0

D M
D M

Reputation: 7179

@Nils has the correct answer.

If you do need to expand your acceptable character set, you can use the following regex. Check it out. The g, m, and i flags are set.

(?<=hello ).*(?=\([^\)]*?\))
Hello Bob (Marius) Smith.
Hello Bob Jr. (Joseph) White
Hello Bob Jr. IInd (Paul) Jobs
Hello X Æ A-Xii (Not Elon) Musk
Hello Bob ()) Jr. ( (Darrell) Black
Match Number Characters Matched Text
Match 1 6-10 Bob
Match 2 32-40 Bob Jr.
Match 3 61-74 Bob Jr. IInd
Match 4 92-102 X Æ A-Xii
Match 5 124-138 Bob ()) Jr. (

The idea is pretty simple:

  1. Look behind for your keyword: (?<=hello ).
  2. Look ahead for your middle name: (?=\([^\)]*?\)) (anything inside a set of parenthesis that is not a closing parenthesis, lazily so you don't take part of the first name).
  3. Take everything between as your first name: .*.

Upvotes: 1

Nils K&#228;hler
Nils K&#228;hler

Reputation: 3001

This would match your cases in ECMAscript.

(?<=[Hh]ello )(?:[^_][\w.]+)+

You need to look for a capital H done by looking for [Hh] instead of h, as your testcases starts with a capital H and your + needs to be single to be used in ECMAscript. also you need to include a . with the \w since it is included in some names.

https://regex101.com/r/lkZK7w/1

-- thanks "D M" for pointing out the missing . in the testcase.

Upvotes: 3

Related Questions