Robert
Robert

Reputation: 11

pattern matching in Python with regex problem

I am trying to learn pattern matching with regex, the course is through coursera and hasn't been updated since python 3 came out so the instructors code is not working correctly.

Here's what I have so far:

# example Wiki data
wiki= """There are several Buddhist universities in the United States. Some of these have existed for decades and are accredited. Others are relatively new and are either in the process of being accredited or else have no formal accreditation. The list includes: 
• Dhammakaya Open University – located in Azusa, California, 
• Dharmakirti College – located in Tucson, Arizona 
• Dharma Realm Buddhist University – located in Ukiah, California 
• Ewam Buddhist Institute – located in Arlee, Montana
• Naropa University - located in Boulder, Colorado 
• Institute of Buddhist Studies – located in Berkeley, California
• Maitripa College – located in Portland, Oregon
• Soka University of America – located in Aliso Viejo, California
• University of the West – located in Rosemead, California 
• Won Institute of Graduate Studies – located in Glenside, Pennsylvania"""




pattern=re.compile(
    r'(?P<title>.*)' # the university title
    r'(-\ located\ in\ )' #an indicator of the location
    r'(?P<city>\w*)' # city the university is in
    r'(,\ )' #seperator for the state
    r'(?P<state>\w.*)') #the state the city is in)


for item in re.finditer(pattern, wiki, re.VERBOSE):
    print(item.groupdict())

Output:

Traceback (most recent call last):
  File "/Users/r..., line 194, in <module>
    for item in re.finditer(pattern, wiki, re.VERBOSE):
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 223, in finditer
    return _compile(pattern, flags).finditer(string)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/re/__init__.py", line 282, in _compile
    raise ValueError(
ValueError: cannot process flags argument with a compiled pattern

I only want a dictionary with the university name, the city and the state. If I run it without re.VERBOSE, only one school shows up and none of the rest are there. I am somewhat new to python and don't know what to do about these errors

Upvotes: 0

Views: 86

Answers (3)

The fourth bird
The fourth bird

Reputation: 163217

In your example data you are using 2 types of hyphens.

If you want to match both you can make use of a character class [–-]

Apart from that, using .* repeats 0+ times any character (can match empty strings) and will first match until the end of the line and will allow backtracking to match the rest of the pattern.

What you could do it make the pattern a bit more precise starting each group matching at least a word character.

If you are only interested in the groups title, city and state you don't need the other 2 capture groups.

Note that if you want to match a space that you don't have to escape it.

^\W*(?P<title>\w.*?) [–-] located in (?P<city>\w.*?), (?P<state>\w.*)
  • ^ Start of string
  • \W* Match optional non word characters
  • (?P<title>\w.*?) Match a word character, followed by matching as least as possible chars
  • [–-] Match any of the dashes with a space to the left and right
  • located in Match literally
  • (?P<city>\w.*?) Match a word character followed by matching as least as possible chars
  • , Match literally
  • (?P<state>\w.*) Match a word character followed by the rest of the line

Regex demo | Python demo

Example

import re

pattern = r"^\W*(?P<title>\w.*?) [–-] located in (?P<city>\w.*?), (?P<state>\w.*)"

wiki = """There are several Buddhist universities in the United States. Some of these have existed for decades and are accredited. Others are relatively new and are either in the process of being accredited or else have no formal accreditation. The list includes:
• Dhammakaya Open University – located in Azusa, California,
• Dharmakirti College – located in Tucson, Arizona
• Dharma Realm Buddhist University – located in Ukiah, California
• Ewam Buddhist Institute – located in Arlee, Montana
• Naropa University - located in Boulder, Colorado
• Institute of Buddhist Studies – located in Berkeley, California
• Maitripa College – located in Portland, Oregon
• Soka University of America – located in Aliso Viejo, California
• University of the West – located in Rosemead, California
• Won Institute of Graduate Studies – located in Glenside, Pennsylvania"""

for item in re.finditer(pattern, wiki, re.M):
    print(item.groupdict())

Output

{'title': 'Dhammakaya Open University', 'city': 'Azusa', 'state': 'California,'}
{'title': 'Dharmakirti College', 'city': 'Tucson', 'state': 'Arizona'}
{'title': 'Dharma Realm Buddhist University', 'city': 'Ukiah', 'state': 'California'}
{'title': 'Ewam Buddhist Institute', 'city': 'Arlee', 'state': 'Montana'}
{'title': 'Naropa University', 'city': 'Boulder', 'state': 'Colorado'}
{'title': 'Institute of Buddhist Studies', 'city': 'Berkeley', 'state': 'California'}
{'title': 'Maitripa College', 'city': 'Portland', 'state': 'Oregon'}
{'title': 'Soka University of America', 'city': 'Aliso Viejo', 'state': 'California'}
{'title': 'University of the West', 'city': 'Rosemead', 'state': 'California'}
{'title': 'Won Institute of Graduate Studies', 'city': 'Glenside', 'state': 'Pennsylvania'}

Upvotes: 0

Robert
Robert

Reputation: 11

Thanks to JustLearning, my problem is solved. Here is the code I ended up using. I can't believe it was a long hyphen instead of a short one. And now I know I dont need to use the re.VERBOSE. Thank you again

pattern =re.compile( r'(?P.)' r'(-\ located\ in\ )' r'(?P.)' r'(,\ )' r'(?P.*)')

Upvotes: 1

CrisPlusPlus
CrisPlusPlus

Reputation: 2302

In fact, for current versions of Python, you do not need to add re.VERBOSE at all. If you do

for item in re.finditer(pattern, wiki):                                                                 
    print(item.groupdict())

the program will print

{'title': '• Naropa University ', 'city': 'Boulder', 'state': 'Colorado '}

using Python 3.10.

By the way, the program only outputs one school because the other schools use a long hyphen instead or a short one, -. Making all schools use the same, and changing your pattern accordingly, should give you the whole list.

Upvotes: 0

Related Questions