Reputation: 598

Handle Unicode characters with Python regexes

I'm writing a simple application where I want to replace certain words with other words. I'm running into problems with words that use single quotes such as aren't, ain't, isn't.

I have a text file with the following

aren’t=ain’t
hello=hey

I parse the text file and create a dictionary out of it

u'aren\u2019t' = u'ain\u2019t'
u'hello' = u'hey'

Then I try to replace all the characters in a given text

text = u"aren't"

def replace_all(text, dict):
    for i, k in dict.iteritems():
        #replace all whole words of I with K in lower cased text, regex = \bSTRING\b
        text = re.sub(r"\b" + i + r"\b", k , text.lower())
    return text

The problem is that re.sub() doesnt match u'aren\u2019t' with u"aren't".

What can I do so that my replace_all() function will match both "hello" and `"aren't" and replace them with the appropriate text? Can I do something in Python so that my dictionary doesn't contain Unicode? Could I convert my text to use a Unicode character, or could I modify the regex to match the Unicode character as well as all the other text?

Upvotes: 2

Answers (4)

intrepion

Reputation: 38877

u"aren\u2019t" == u"aren't"

False

u"aren\u2019t" == u"aren’t"

True

Upvotes: 0

Mikel

Reputation: 25656

I guess your problem is:

text = u"aren't"

instead of:

text = u"aren’t"

(note the different apostrophes?)

Here's your code modified to make it work:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

d = {
    u'aren’t': u'ain’t',
    u'hello': u'hey'
    }
#text = u"aren't"
text = u"aren’t"


def replace_all(text, d):
    for i, k in d.iteritems():
        #replace all whole words of I with K in lower cased text, regex = \bSTRING\b
        text = re.sub(r"\b" + i + r"\b", k , text.lower())
    return text

if __name__ == '__main__':
    newtext = replace_all(text, d)
    print newtext

Output:

ain’t

Upvotes: 3

eos87

Reputation: 9363

try saving your file into UTF-8 encode

Upvotes: 0

Adam Rosenfield

Reputation: 400692

This works fine for me in Python 2.6.4:

>>> re.sub(ur'\baren\u2019t\b', 'rep', u'aren\u2019t')
u'rep'

Make sure that your pattern string is a Unicode string, otherwise it might not work.

Upvotes: 0

Handle Unicode characters with Python regexes

Answers (4)

Related Questions