Satwik
Satwik

Reputation: 1321

Convert ruby regular expression definition to python regex

I've following regexes defined for capturing the gem names in a Gemfile.

GEM_NAME = /[a-zA-Z0-9\-_\.]+/

QUOTED_GEM_NAME = /(?:(?<gq>["'])(?<name>#{GEM_NAME})\k<gq>|%q<(?<name>#{GEM_NAME})>)/

I want to convert these into a regex that can be used in python and other languages.

I tried (?:(["'])([a-zA-Z0-9\-_\.]+)\k["']|%q<([a-zA-Z0-9\-_\.]+)>) based on substitution and several similar combinations but none of them worked. Here's the regexr link http://regexr.com/3g527

Can someone please explain what should be correct process for converting these ruby regular expression defintions into a form that can be used by python.

Upvotes: 1

Views: 1052

Answers (3)

user557597
user557597

Reputation:

A simple way is to use a conditional and consolidate the name.

(?:(?:(["'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))

Expanded

 (?:
      (?:                           # Delimiters
           ( ["'] )                      # (1), ' or "
        |                              # or,
           %q<                           # %q
      )
      (?P<name> [a-zA-Z0-9\-_\.]+ ) # (2), Name
      (?(1) \1 | > )                # Did group 1 match ? match it here, else >
 )

Python

import re

s = ' "asdf"  %q<asdfasdf>  '

print ( re.findall( r'(?:(?:(["\'])|%q<)(?P<name>[a-zA-Z0-9\-_\.]+)(?(1)\1|>))', s ) )

Output

[('"', 'asdf'), ('', 'asdfasdf')]

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89584

You can rewrite your pattern like this:

GEM_NAME = r'[a-zA-Z0-9_.-]+'

QUOTED_GEM_NAME = r'''["'%] # first possible character
    (?:(?<=%)q<)? # if preceded by a % match "q<"
    (?P<name> # the three possibilities excluding the delimiters
        (?<=") {0} (?=") |
        (?<=') {0} (?=') |
        (?<=<) {0} (?=>)
    )
    ["'>] #'"# closing delimiter
    (?x) # switch the verbose mode on for all the pattern
'''.format(GEM_NAME)

demo

Advantages:

  • the pattern doesn't start with an alternation that makes the search slow. (the alternation here is only tested at interesting positions after a quote or a %, when your version tests each branch of the alternation for each position in the string). This optimisation technique is called "the first character discrimination" and consists to quickly discard useless positions in a string.
  • you need only one capture group occurrence (quotes and angle brackets are excluded from it and only tested with lookarounds). This way you can use re.findall to get a list of gems without further manipulation.
  • the gq group wasn't useful and was removed (shorten a pattern at the cost of creating a useless capture group isn't a good idea)

Note that you don't need to escape the dot inside a character class.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627103

To define a named group, you need to use (?P<name>) and then (?p=name) named If you can afford a 3rd party library, you may use PyPi regex module and use the approach you had in Ruby (as regex supports multiple identically named capturing groups):

s = """%q<Some-name1> "some-name2" 'some-name3'"""

GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r'(?:(?P<gq>["\'])(?<name>{0})(?P=gq)|%q<(?P<name>{0})>)'.format(GEM_NAME)
print(QUOTED_GEM_NAME)
# => # (?:(?P<gq>["\'])(?<name>[a-zA-Z0-9_.-]+)(?P=gq)|%q<(?P<name>[a-zA-Z0-9_.-]+)>)

import regex
res = [x.group("name") for x in regex.finditer(QUOTED_GEM_NAME, s)]
print(res)
# => ['Some-name1', 'some-name2', 'some-name3']

backreference in the replacement pattern.

See this Python demo.

If you decide to go with Python re, it can't handle identically named groups in one regex pattern.

You can discard the named groups altogether and use numbered ones, and use re.finditer to iterate over all the matches with comprehension to grab the right capture.

Example Python code:

import re
GEM_NAME = r'[a-zA-Z0-9_.-]+'
QUOTED_GEM_NAME = r"([\"'])({0})\1|%q<({0})>".format(GEM_NAME)
s = """%q<Some-name1> "some-name2" 'some-name3'"""
matches = [x.group(2) if x.group(1) else x.group(3) for x in re.finditer(QUOTED_GEM_NAME, s)]
print(matches)
# => ['Some-name1', 'some-name2', 'some-name3']

So, ([\"'])({0})\1|%q<({0})> has got 3 capturing groups: if Group 1 matches, the first alternative got matched, thus, Group 2 is taken, else, the second alternative matched, and Group 3 value is grabbed in the comprehension.

Pattern details

  • ([\"']) - Group 1: a " or '
  • ({0}) - Group 2: GEM_NAME pattern
  • \1 - inline backreference to the Group 1 captured value (note that r'...' raw string literal allows using a single backslash to define a backreference in the string literal)
  • | - or
  • %q< - a literal substring
  • ({0}) - Group 3: GEM_NAME pattern
  • > - a literal >.

Upvotes: 1

Related Questions