Regular expressions: distinguish strings including/excluding a given word

Question

I'm working in Python and try to handle StatsModel's GLM output. I'm relatively new to regular expressions. I have strings such as

string_1 = "C(State)[T.Kansas]"
string_2 = "C(State, Treatment('Alaska'))[T.Kansas]"

I wrote the following regex:

pattern = re.compile('C$(.+?)$$$T\.(.+?)$$') 
print(pattern.search(string_1).group(1))
#State
print(pattern.search(string_2).group(1))
#State, Treatment('Alaska')

So both of these strings match the pattern. But we want to get State in both cases. Basically we want to get read of everything after comma (including it) inside first brackets.

How can we distinguish the string_2 pattern from string_1's and extract only State without , Treatment?

Mad Physicist · Accepted Answer

You can add an optional non-capturing group instead of just allowing all characters:

pattern = re.compile('C$(.+?)(?:, .+?)?$$$T\.(.+?)$$')

(?:...) groups the contents together without capturing it. The trailing ? makes the group optional.

Regular expressions: distinguish strings including/excluding a given word

Answers (2)

Related Questions