Pialy Tapaswi
Pialy Tapaswi

Reputation: 541

Regular Expression to split a string with parentheses

Need help with RegEx. Using C#.

Group of Words in parentheses (round or box or curly) should be considered as one word. The part, which is outside parentheses, should split based on white space ' '.

A) Test Case –

Input - Andrew. (The Great Musician) John Smith-Lt.Gen3rd

Result (Array of string) –
1. Andrew.
2. The Great Musician
3. John
4. Smith-Lt.Gen3rd

B) Test Case –

Input - Andrew. John

Result (Array of string) –
1. Andrew.
2. John

C) Test Case –

Input - Andrew {The Great} Pirate

Result (Array of string) –
1. Andrew
2. The Great
3. Pirate

The input is name of a person or any other entity. Current system is very old written in Access. They did it by scanning character by character. I am replacing it with C#.

I thought of doing it in two steps – first parentheses based split and then word split.

I wanted to throw these cases out as bad input -

  1. Only Starting or ending parentheses available

  2. nested parentheses

Overall, I wanted to split only well-formed (if start parentheses is there, there must be an ending) Inputs only.

Upvotes: 1

Views: 4026

Answers (2)

Nick
Nick

Reputation: 4362

Here is a regex that will give the proper results from your examples:

\s(?=.*?(?:\(|\{|\[).*?(?:\]|\}|\)).*?)|(?<=(?:\(|\[|\{).*?(?:\}|\]|\)).*?)\s

This regex is in two parts, separated by an |(OR) statement:

  1. \s(?=.*?(?:\(|\{|\[).*?(?:\]|\}|\)).*?) - Looks for a white space before sets of (), [], or {}
  2. (?<=(?:\(|\[|\{).*?(?:\}|\]|\)).*?)\s - Looks for a white space after sets of (), [], or {}

Here is the breakdown of each part:

Part 1 (\s(?=.*?(?:\(|\{|\[).*?(?:\]|\}|\)).*?)):

1. \s             - matches white space
2. (?=            - Begins a lookahead assertion (What is included must exist after the \s
3. .*?            - Looks for any character any number of times. The `?` makes in ungreedy, so it will grab the least number it needs
4. (?:\(|\{|\[)   - A non passive group looking for `(`, `{`, or `[`
5. .*?            - Same as #3
6. (?:\]|\}|\))   - The reverse of #4
7. .*?            - Same as #3
8. )              - Closes the lookahead.  #3 through #7 are in the lookahead.

Part 2 is the same thing, but instead of the lookahead ((?=)) it has a lookbehind ((?<=))

After Questions edit by author:

For a regex that will search for lines with only complete parentheses, you can use this:

.*\(.*(?=.*?\).*?)|(?<=.*?\(.*?).*\).*

You can use it to replace ( and ) with { and } or [ and ] so you have complete curly and square brackets.

Upvotes: 5

Tim Pietzcker
Tim Pietzcker

Reputation: 336408

How about this:

Regex regexObj = new Regex(
    @"(?<=\()       # Assert that the previous character is a (
    [^(){}[\]]+     # Match one or more non-paren/brace/bracket characters
    (?=\))          # Assert that the next character is a )
    |               # or
    (?<=\{)[^(){}[\]]+(?=\}) # Match {...}
    |               # or 
    (?<=\[)[^(){}[\]]+(?=\]) # Match [...]
    |               # or
    [^(){}[\]\s]+   # Match anything except whitespace or parens/braces/brackets", 
    RegexOptions.IgnorePatternWhitespace);

This assumes no nested parentheses/braces/brackets.

Upvotes: 1

Related Questions