Dima
Dima

Reputation: 277

Regular expressions: distinguish strings including/excluding a given word

I'm working in Python and try to handle StatsModel's GLM output. I'm relatively new to regular expressions. I have strings such as

string_1 = "C(State)[T.Kansas]"
string_2 = "C(State, Treatment('Alaska'))[T.Kansas]"

I wrote the following regex:

pattern = re.compile('C\((.+?)\)\[T\.(.+?)\]') 
print(pattern.search(string_1).group(1))
#State
print(pattern.search(string_2).group(1))
#State, Treatment('Alaska')

So both of these strings match the pattern. But we want to get State in both cases. Basically we want to get read of everything after comma (including it) inside first brackets.

How can we distinguish the string_2 pattern from string_1's and extract only State without , Treatment?

Upvotes: 3

Views: 63

Answers (2)

anubhava
anubhava

Reputation: 785156

You may use this regex using negative character classes:

C\((\w+)[^[]*\[T\.([^]]+)\]

RegEx Demo

Upvotes: 3

Mad Physicist
Mad Physicist

Reputation: 114320

You can add an optional non-capturing group instead of just allowing all characters:

pattern = re.compile('C\((.+?)(?:, .+?)?\)\[T\.(.+?)\]')

(?:...) groups the contents together without capturing it. The trailing ? makes the group optional.

Upvotes: 3

Related Questions