Reputation: 277
I'm working in Python and try to handle StatsModel's GLM output. I'm relatively new to regular expressions. I have strings such as
string_1 = "C(State)[T.Kansas]"
string_2 = "C(State, Treatment('Alaska'))[T.Kansas]"
I wrote the following regex:
pattern = re.compile('C\((.+?)\)\[T\.(.+?)\]')
print(pattern.search(string_1).group(1))
#State
print(pattern.search(string_2).group(1))
#State, Treatment('Alaska')
So both of these strings match the pattern. But we want to get State
in both cases. Basically we want to get read of everything after comma (including it) inside first brackets.
How can we distinguish the string_2
pattern from string_1
's and extract only State
without , Treatment
?
Upvotes: 3
Views: 63
Reputation: 785156
You may use this regex using negative character classes:
C\((\w+)[^[]*\[T\.([^]]+)\]
Upvotes: 3
Reputation: 114320
You can add an optional non-capturing group instead of just allowing all characters:
pattern = re.compile('C\((.+?)(?:, .+?)?\)\[T\.(.+?)\]')
(?:...)
groups the contents together without capturing it. The trailing ?
makes the group optional.
Upvotes: 3