keytrap
keytrap

Reputation: 480

Regex match all words enclosed by parentheses and separated by a pipe

I think an image a better than words sometimes.

enter image description here

My problem as you can see, is that It only matches two words by two. How can I match all of the words ?

My current regex (PCRE) : ([^\|\(\)\|]+)\|([^\|\(\)\|]+)

The goal : retrieve all the words in a separate groupe for each of them

Upvotes: 2

Views: 414

Answers (2)

The fourth bird
The fourth bird

Reputation: 163277

In c# you can also make use of the group captures using a capture group.

The matches are in named group word

\((?<word>\w+)(?:\|(?<word>\w+))*\)
  • \( Match (
  • (?<word>\w+) Match 1+ word chars in group word
  • (?: Non capture group
    • \| Match |
    • (?<word>\w+) Match 1+ word chars
  • )* Close the non capture group and optionally repeat to get all occurrences
  • \) Match the closing parenthesis

Code example provided by Wiktor Stribiżew in the comments:

var line = "I love (chocolate|fish|honey|more)";
var output = Regex.Matches(line, @"\((?<word>\w+)(?:\|(?<word>\w+))*\)")
    .Cast<Match>()
    .SelectMany(x => x.Groups["word"].Captures);
foreach (var s in output)
        Console.WriteLine(s);

Output

chocolate
fish
honey
more

foreach (var s in output) Console.WriteLine(s);

Regex demo

enter image description here

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626748

You can use an infinite length lookbehind in C# (with a lookahead):

(?<=\([^()]*)\w+(?=[^()]*\))

To match any kind of strings inside parentheses, that do not consist of (, ) and |, you will need to replace \w+ with [^()|]+:

(?<=\([^()]*)[^()|]+(?=[^()]*\))
//            ^^^^^^

See the regex demo (and regex demo #2). Details:

  • (?<=\([^()]*) - a positive lookbehind that matches a location that is immediately preceded with ( and then zero or more chars other than ( and )
  • \w+ - one or more word chars
  • (?=[^()]*\)) - a positive lookahead that matches a location that is immediately followed with zero or more chars other than ( and ) and then a ) char.

Another way to capture these words is by using

(?:\G(?!^)\||\()(\w+)(?=[^()]*\))     // words as units consisting of letters/digits/diacritics/connector punctuation
(?:\G(?!^)\||\()([^()|]+)(?=[^()]*\)) // "words" that consist of any chars other than (, ) and |

See this regex demo. The words you need are now in Group 1. Details:

  • (?:\G(?!^)\||\() - a position after the previous match (\G(?!^)) and a | char (\|), or (|) a ( char (\()
  • (\w+) - Group 1: one or more word chars
  • (?=[^()]*\)) - a positive lookahead that makes sure there is a ) char after any zero or more chars other than ( and ) to the right of the current position.

Extracting the matches in C# can be done with

var matches = Regex.Matches(text, @"(?<=\([^()]*)\w+(?=[^()]*\))")
    .Cast<Match>()
    .Select(x => x.Value);

// Or
var matches = Regex.Matches(text, @"(?:\G(?!^)\||\()(\w+)(?=[^()]*\))")
    .Cast<Match>()
    .Select(x => x.Groups[1].Value);

Upvotes: 5

Related Questions