Reputation: 971

Regex: Matching all words EXCEPT those inside of parenthesis (C#)

So given:

COLUMN_1, COLUMN_2, COLUMN_3, ((COLUMN_1) AS SOME TEXT) AS COLUMN_4, COLUMN_5

How would I go about getting my matches as:

COLUMN_1
COLUMN_2
COLUMN_3
COLUMN_4
COLUMN_5

I've tried:

(?<!(\(.*?\)))(\w+)(,\s*\w+)*?

But I feel like I'm way off base :( I'm using regexstorm.net for testing.

Appreciate any help :)

Upvotes: 3

Answers (4)

Luis Colorado

Reputation: 12708

Matching all words except some set of them is one of the most difficult exercises you can do with regular expressions. The easy way is: just construct the finite automata that accepts your original non negated predicate about the strings it should accept, then change all the accepting states by non-accepting ones, and finally construct a regular expression that is equivalent to the automata just constructed. This is a task difficult to do, so the most easy way to deal with it is construct the regexp for the predicate you want to negate and pass your string through the regexp matcher, if it maches, just reject it.

The main problem with this is that that is easy to do with computers, but constructing a regular expression from an automata description is tedious and normally gives you not the result you want (and actually a huge result). Let me illustrate with an example:

You have asked for matching words, but from these words, you want the ones that don't appear in a set of them. Let's suppose we want the automata that matches preciselly that set of words, and suppose we have matched the first n-1 letters of that word. This string should be matched, but only if you don't get the final letter next. So the proper regexp should be a regexp that matches all the letters of the first word but the last.... Not, we can skip this test if we have a word that matches all the letters in the first word but the last two, and so successively, back to the first letter (obviously, if your regexp doesn't begin with the first letter of the word, it doesn't match anyway) Let's suppose the first word is BEGIN. A good regexp matching things that are not equal to BEGIN is something like this:

[^B]|B[^E]|BE[^G]|BEG[^I]|BEGI[^N]

a different scenario (that complicates things more) is to find a regexp that matches the string if the word BEGIN is not contained in the string. Let's part from the opposite predicate, to find a string that has the word BEGIN included

^.*BEGIN.*$

and let's construct its finite automata:

(0)---B--->(1)---E--->(2)---G--->(3)---I--->(4)---N--->((5))
 ^ \        |          |          |          |           ^ \
 | |        |          |          |          |           | |
 `-+<-------+<---------+<---------+<---------'           `-+

where the double parenthesis indicates an accepting state. If you just change all the accepting states with non-accepting ones, you'll get an automata that accepts all the strings the first one didn't and viceversa.

((0))--B-->((1))--E-->((2))--G-->((3))--I-->((4))--N-->(5)
 ^ \         |          |          |          |         ^ \
 | |         |          |          |          |         | |
 `-+<--------+<---------+<---------+<---------'         `-+

But converting this into a simple regular expression is far from easy (you can try, if you don't believe me)

And this only with one word, so think how to match any of the words, construct the automata, and then switch the acceptance-nonacceptance status of each state.

In your case, we have something to deal with, in addition to the premise your predicate is not equivalent to the one I have formulated. My predicate is for matching expressions that have one word in it (which is the target for which regexp were conceived) but yours if for matching groups inside your regexp. If you try my example, you will find that a simple string as "" (the empty string) matches the second regexp, as the starting ((0)) state is accepting state (well, the empty string doesn't contain the word BEGIN), but you want your regexp to match words (and "" isn't a word) so we first need to define what is a word for you and construct the regular expression that matches a word:

[a-zA-Z][a-zA-Z]*

should be a good candidate. It should go in an automata definition like this:

(0)---[a-zA-Z]--->((1))---[a-zA-Z]--.
 ^ \               |  ^             |
 |  *              *  |             |
 `--+<-------------'  `-------------'

and you want an automata to accept both (1-must be a word, and 2-not in the set of words) (not being in the set of words is the same as not being the first word, and not being the second and not being the third... you can construct it by first constructing an automata that matches if it's the first word, or the second, or the third, ... and then negating it) construct the first automaton, the second and then construct an automaton that matches both. This, again is easy to be done with automatons for computers, but not for people.

As I said, construct an automaton from a regexp is an easy and direct thing for a computer, but not for a person. Construct a regexp from an automaton is also, but it results in huge regular expressions and because of this problem, most implementations have result in implementation of extender operators that match if some regexp doesn't and the opposite.

CONCLUSION

Use the negation operators that allow you to get to the opposite predicate about the set of strings your regexp acceptor must accept, or just simply construct a regexp to do simple things and use the boolean algebra to do the rest.

Upvotes: 1

Tim Pietzcker

Reputation: 336478

You need a regex that keeps track of opening and closing parentheses and makes sure that a word is only matched if a balanced set of parentheses (or no parentheses at all) follow:

Regex regexObj = new Regex(
    @"\w+                  # Match a word
    (?=                    # only if it's possible to match the following:
        (?>                # Atomic group (used to avoid catastrophic backtracking):
           [^()]+          # Match any characters except parens
        |                  # or
           \(  (?<DEPTH>)  # a (, increasing the depth counter
        |                  # or
           \)  (?<-DEPTH>) # a ), decreasing the depth counter
        )*                 # any number of times.
        (?(DEPTH)(?!))     # Then make sure the depth counter is zero again
        $                  # at the end of the string.
    )                      # (End of lookahead assertion)", 
    RegexOptions.IgnorePatternWhitespace);

I tried to provide a test link to regexstorm.net, but it was too long for StackOverflow. Apparently, SO also doesn't like URL shorteners, so I can't link this directly, but you should be able to recreate the link easily: http://bit[dot]ly/2cNZS0O

Upvotes: 2

revo

Reputation: 48751

Since you have nested parentheses things get trickier. Although .NET RegEx engine provides balancing group constructs which uses stack memory, I go with a more general approach called recursive match.

Regex:

\((?(?!\(|\)).|(?R))*\)|(\w+)

Live demo

All you need is in first capturing group.

Explanation of left side of alternation:

\(           # Match an opening bracket
(?(?!\(|\))  # If next character is not `(` or `)`
    .             # Then match it
    |             # Otherwise
    (?R)          # Recurs whole pattern
)*           # As much as possible
\)           # Up to corresponding closing bracket

Upvotes: 0

user1134181

Reputation:

This should work:

(?<!\()COLUMN_[\d](?!\))

Try it: https://regex101.com/r/bC4D7n/1

Update:

Ok, then try to use this regular expression:

[\(]+[\w\s\W]+[\)]+

Demo here: https://regex101.com/r/bC4D7n/2

Upvotes: 1

Regex: Matching all words EXCEPT those inside of parenthesis (C#)

Answers (4)

CONCLUSION

Related Questions