Reputation:
I suck at Python regex and would love to see some solved examples to help me gain understanding. I am experimenting using http://pyregex.com/ which is great but need some 'good' examples to get me started.
I try to create a set of rules like so:
rules = [('name', r'[a-z]+'),
('operator', r'[+-*\]']
which I have found but not confident enough to create my own regexes for cases like the ones listed below:
=
or +=
or *=
characters+
character (i.e the operator
as seen above) separately from the ++
charactersint
) and any number of space(s) and/or tabs. [edited - initially had followed which was wrong]For 1. I have tried [\+=|=]
, for 2. I know the order in the rules is important and for 3. I am completely lost with the []
and on how I can generalize that case to work not just for int
, but for float
as well.
Any code examples will be greatly appreciated since I am only just starting with Python and coding!
Upvotes: 1
Views: 448
Reputation: 627292
match only the = or += or *= characters
r'[+*]?='
The [+*]?=
consists of an optional atom, a character class [+*]
that matches either a +
or a *
, ?
- one or zero times, and a literal =
symbol. Why not r'\+=|\*=|='
? Not only the optional character class solution is shorter, but also it is more efficient: when you use alternation, you always have more redundant backtracking involved. You also need to be attentive to place the alternatives in a correct order, so that the longest appears first (although that does not always guarantee that the longest will match (depends on the branch subpatterns), or the order does not matter if there are anchors on both sides of the alternation group).
match the + character (i.e the operator as seen above) separately from the ++ characters
r'(?<!\+)\+(?!\+)'
This pattern matches a literal +
(as it is escaped) and only in case it is neither preceded with another plus (see the negative lookbehind (?<!\+)
) nor followed with another plus (see the positive lookahead (?!\+)
). The lookarounds are non-consuming, i.e. the regex index remains right before a plus when it checks for a plus in front of it, and after the plus when it checks for a plus after it. The characters (or start/end of string positions) are not returned as part of the match (that is why they are called zero-width, non-capturing patterns).
match any one word after a specific keyword (e.g. int) and any number of space(s) and/or tabs.
r'\bint\b(?=\s+\w+\s+)'
If you read the explanation above, you will recognize another zero-width assertion here: (?=\s+\w+\s+)
is a positive lookahead that checks if a whole word int
(as \b
matches word boundary positions) is followed with 1+ whitespaces, then 1+ word characters, and then again 1+ whitespaces.
Upvotes: 1
Reputation: 1341
The examples provided in the documentation and in the previous answers should get you started in the right path. An additional consideration, since you said you are new to programming and Python, is that regular expressions are an intermediate to advanced topic (depending what you want to do with it) and should be tackled once you have a better grasp of good programming practices and Python's fundamentals. In any case more information and examples can be found at: Python Regular Expressions module.
Upvotes: 0