staticdev
staticdev

Reputation: 3060

Problem omitting optional word in python3 regex

I need a regex that captures 2 groups: a movie and the year. Optionally, there could be a 'from ' string between them.

My expected results are:

first_query="matrix 2013" => ('matrix', '2013')
second_query="matrix from 2013" => ('matrix', '2013')
third_query="matrix" => ('matrix', None)

I've done 2 simulations on https://regex101.com/ for python3: I- r"(.+)(?:from ){0,1}([1-2]\d{3})" Doesn't match first_query and third_query, also doesn't omit 'from' in group one, which is what I want to avoid.

II- r"(.+)(?:from ){1}([1-2]\d{3})" Works with second_query, but does not match first_query and third_query.

Is it possible to match all three strings, omitting the 'from ' string from the first group?

Thanks in advance.

Upvotes: 1

Views: 95

Answers (3)

jez
jez

Reputation: 15349

import re

pattern = re.compile( r"""
    ^\s*              # start of string (optional whitespace)
    (?P<title>\S+)    # one or more non-whitespace characters (title)
    (?:\s+from)?      # optionally, some space followed by the word 'from'
    \s*               # optional whitespace
    (?P<year>[0-9]+)? # optional digit string (year)
    \s*$              # end of string (optional whitespace)
""", re.VERBOSE )

for query in [ 'matrix 2013', 'matrix from 2013', 'matrix' ]:
    m = re.match( pattern, query )
    if m: print( m.groupdict() )

# Prints:
# {'title': 'matrix', 'year': '2013'}
# {'title': 'matrix', 'year': '2013'}
# {'title': 'matrix', 'year': None}

Disclaimer: this regex does not contain the logic necessary to reject the first two matches on the grounds that The Matrix actually came out in 1999.

Upvotes: 1

Patrick Artner
Patrick Artner

Reputation: 51643

This will output your patters, but have a space too much in from of the number:

import re

pat = r"^(.+?)(?: from)? ?(\d+)?$"


text = """matrix 2013
matrix from 2013
matrix"""

for t in text.split("\n"):
    print(re.findall(pat,t))

Output:

[('matrix', '2013')]
[('matrix', '2013')]
[('matrix', '')]

Explanation:

 ^           start of string
(.+?)        lazy anythings as few as possible
(?: from)?   non-grouped optional ` from`
 ?           optional space
(\d+=)?$     optional digits till end of string

Demo: https://regex101.com/r/VD0SZb/1

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You may use

^(.+?)(?:\s+(?:from\s+)?([12]\d{3}))?$

See the regex demo

Details

  • ^ - start of a string
  • (.+?) - Group 1: any 1+ chars other than line break chars, as few as possible
  • (?:\s+(?:from\s+)?([12]\d{3}))? - an optional non-capturing group matching 1 or 0 occurrences of:
    • \s+ - 1+ whitespaces
    • (?:from\s+)? - an optional sequence of from substring followed with 1+ whitespaces
    • ([12]\d{3}) - Group 2: 1 or 2 followed with 3 digits
  • $ - end of string.

Upvotes: 3

Related Questions