Reputation: 1321
I've following regexes defined for capturing the gem names in a Gemfile.
GEM_NAME = /[a-zA-Z0-9\-_\.]+/
QUOTED_GEM_NAME = /(?:(?<gq>["'])(?<name>#{GEM_NAME})\k<gq>|%q<(?<name>#{GEM_NAME})>)/
I want to convert these into a regex that can be used in python and other languages.
I tried (?:(["'])([a-zA-Z0-9\-_\.]+)\k["']|%q<([a-zA-Z0-9\-_\.]+)>)
based on substitution and several similar combinations but none of them worked. Here's the regexr link http://regexr.com/3g527
Can someone please explain what should be correct process for converting these ruby regular expression defintions into a form that can be used by python.
Upvotes: 1
Views: 1052
Reputation:
A simple way is to use a conditional and consolidate the name.
(?:(?:(["'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))
Expanded
(?:
(?: # Delimiters
( ["'] ) # (1), ' or "
| # or,
%q< # %q
)
(?P<name> [a-zA-Z0-9\-_\.]+ ) # (2), Name
(?(1) \1 | > ) # Did group 1 match ? match it here, else >
)
Python
import re
s = ' "asdf" %q<asdfasdf> '
print ( re.findall( r'(?:(?:(["\'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))', s ) )
Output
[('"', 'asdf'), ('', 'asdfasdf')]
Upvotes: 1
Reputation: 89584
You can rewrite your pattern like this:
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'''["'%] # first possible character
(?:(?<=%)q<)? # if preceded by a % match "q<"
(?P<name> # the three possibilities excluding the delimiters
(?<=") {0} (?=") |
(?<=') {0} (?=') |
(?<=<) {0} (?=>)
)
["'>] #'"# closing delimiter
(?x) # switch the verbose mode on for all the pattern
'''.format(GEM_NAME)
Advantages:
re.findall
to get a list of gems without further manipulation.Note that you don't need to escape the dot inside a character class.
Upvotes: 1
Reputation: 627103
To define a named group, you need to use (?P<name>)
and then (?p=name)
named
If you can afford a 3rd party library, you may use PyPi regex module and use the approach you had in Ruby (as regex
supports multiple identically named capturing groups):
s = """%q<Some-name1> "some-name2" 'some-name3'"""
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'(?:(?P<gq>["\'])(?<name>{0})(?P=gq)|%q<(?P<name>{0})>)'.format(GEM_NAME)
print(QUOTED_GEM_NAME)
# => # (?:(?P<gq>["\'])(?<name>[a-zA-Z0-9_.-]+)(?P=gq)|%q<(?P<name>[a-zA-Z0-9_.-]+)>)
import regex
res = [x.group("name") for x in regex.finditer(QUOTED_GEM_NAME, s)]
print(res)
# => ['Some-name1', 'some-name2', 'some-name3']
backreference in the replacement pattern.
See this Python demo.
If you decide to go with Python re
, it can't handle identically named groups in one regex pattern.
You can discard the named groups altogether and use numbered ones, and use re.finditer
to iterate over all the matches with comprehension to grab the right capture.
import re
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r"([\"'])({0})\1|%q<({0})>".format(GEM_NAME)
s = """%q<Some-name1> "some-name2" 'some-name3'"""
matches = [x.group(2) if x.group(1) else x.group(3) for x in re.finditer(QUOTED_GEM_NAME, s)]
print(matches)
# => ['Some-name1', 'some-name2', 'some-name3']
So, ([\"'])({0})\1|%q<({0})>
has got 3 capturing groups: if Group 1 matches, the first alternative got matched, thus, Group 2 is taken, else, the second alternative matched, and Group 3 value is grabbed in the comprehension.
Pattern details
([\"'])
- Group 1: a "
or '
({0})
- Group 2: GEM_NAME
pattern\1
- inline backreference to the Group 1 captured value (note that r'...'
raw string literal allows using a single backslash to define a backreference in the string literal)|
- or%q<
- a literal substring({0})
- Group 3: GEM_NAME
pattern>
- a literal >
.Upvotes: 1