gloriousCatnip
gloriousCatnip

Reputation: 431

Python regex is having problems finding a special unicode character

I am currently parsing through some old exams to determine the frequency of the questions (because many questions would resurface at this years exam). I am using pyperclip to get the input for the re.findall.

This is the regex I am using: pattern = re.compile(ur'\d.[a-zA-Z .,\']+\?', re.UNICODE), and this is an example question on an older exam (the pattern I am trying to find): 9. In Wycherley’s The Country Wife, what does Mr. Pinchwife threaten to inscribe on Mrs. Pinchwife’s face with his penknife? The apostrophe is not one I can find on my keyboard, and trying to execute the code results in this error:

 File "examAnalyzer.py", line 7
    pattern = re.compile(ur'\d.[a-zA-Z .,\Æ]+\?', re.UNICODE)
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

I am using Python 2.7.11 and Anaconda 4.0, and the Python file is edited using VIM.

Upvotes: 1

Views: 261

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

You can use the \u representation of the apostrophe, which is \u2019.

Also, the dot should be escaped to match a literal dot symbol.

Use

ur'\d\.[a-zA-Z .,\'\u2019]+\?'
     ^^            ^^^^^^  

When in doubt what the hex representation a symbol has, you can check it at r12a >> apps >> Unicode code converter.

Upvotes: 1

Daniel
Daniel

Reputation: 42748

Your python file has declared a file encoding of utf8 but the file itself is saved in another encoding.

You should give the correct encoding in the first line:

# -*- coding: <correct encoding> -*-

Upvotes: 0

Related Questions