Reputation: 541
Need help with RegEx. Using C#.
Group of Words in parentheses (round or box or curly) should be considered as one word. The part, which is outside parentheses, should split based on white space ' '.
A) Test Case –
Input - Andrew. (The Great Musician) John Smith-Lt.Gen3rd
Result (Array of string) –
1. Andrew.
2. The Great Musician
3. John
4. Smith-Lt.Gen3rd
B) Test Case –
Input - Andrew. John
Result (Array of string) –
1. Andrew.
2. John
C) Test Case –
Input - Andrew {The Great} Pirate
Result (Array of string) –
1. Andrew
2. The Great
3. Pirate
The input is name of a person or any other entity. Current system is very old written in Access. They did it by scanning character by character. I am replacing it with C#.
I thought of doing it in two steps – first parentheses based split and then word split.
I wanted to throw these cases out as bad input -
Only Starting or ending parentheses available
nested parentheses
Overall, I wanted to split only well-formed (if start parentheses is there, there must be an ending) Inputs only.
Upvotes: 1
Views: 4026
Reputation: 4362
Here is a regex that will give the proper results from your examples:
\s(?=.*?(?:\(|\{|\[).*?(?:\]|\}|\)).*?)|(?<=(?:\(|\[|\{).*?(?:\}|\]|\)).*?)\s
This regex is in two parts, separated by an |
(OR) statement:
\s(?=.*?(?:\(|\{|\[).*?(?:\]|\}|\)).*?)
- Looks for a white space before sets of ()
, []
, or {}
(?<=(?:\(|\[|\{).*?(?:\}|\]|\)).*?)\s
- Looks for a white space after sets of ()
, []
, or {}
Here is the breakdown of each part:
Part 1 (\s(?=.*?(?:\(|\{|\[).*?(?:\]|\}|\)).*?)
):
1. \s - matches white space
2. (?= - Begins a lookahead assertion (What is included must exist after the \s
3. .*? - Looks for any character any number of times. The `?` makes in ungreedy, so it will grab the least number it needs
4. (?:\(|\{|\[) - A non passive group looking for `(`, `{`, or `[`
5. .*? - Same as #3
6. (?:\]|\}|\)) - The reverse of #4
7. .*? - Same as #3
8. ) - Closes the lookahead. #3 through #7 are in the lookahead.
Part 2 is the same thing, but instead of the lookahead ((?=)
) it has a lookbehind ((?<=)
)
After Questions edit by author:
For a regex that will search for lines with only complete parentheses, you can use this:
.*\(.*(?=.*?\).*?)|(?<=.*?\(.*?).*\).*
You can use it to replace (
and )
with {
and }
or [
and ]
so you have complete curly and square brackets.
Upvotes: 5
Reputation: 336408
How about this:
Regex regexObj = new Regex(
@"(?<=\() # Assert that the previous character is a (
[^(){}[\]]+ # Match one or more non-paren/brace/bracket characters
(?=\)) # Assert that the next character is a )
| # or
(?<=\{)[^(){}[\]]+(?=\}) # Match {...}
| # or
(?<=\[)[^(){}[\]]+(?=\]) # Match [...]
| # or
[^(){}[\]\s]+ # Match anything except whitespace or parens/braces/brackets",
RegexOptions.IgnorePatternWhitespace);
This assumes no nested parentheses/braces/brackets.
Upvotes: 1