Reputation: 598
I'm writing a simple application where I want to replace certain words with other words. I'm running into problems with words that use single quotes such as aren't
, ain't
, isn't
.
I have a text file with the following
aren’t=ain’t
hello=hey
I parse the text file and create a dictionary out of it
u'aren\u2019t' = u'ain\u2019t'
u'hello' = u'hey'
Then I try to replace all the characters in a given text
text = u"aren't"
def replace_all(text, dict):
for i, k in dict.iteritems():
#replace all whole words of I with K in lower cased text, regex = \bSTRING\b
text = re.sub(r"\b" + i + r"\b", k , text.lower())
return text
The problem is that re.sub()
doesnt match u'aren\u2019t'
with u"aren't"
.
What can I do so that my replace_all()
function will match both "hello"
and `"aren't" and replace them with the appropriate text? Can I do something in Python so that my dictionary doesn't contain Unicode? Could I convert my text to use a Unicode character, or could I modify the regex to match the Unicode character as well as all the other text?
Upvotes: 2
Views: 815
Reputation: 38877
u"aren\u2019t" == u"aren't"
False
u"aren\u2019t" == u"aren’t"
True
Upvotes: 0
Reputation: 25656
I guess your problem is:
text = u"aren't"
instead of:
text = u"aren’t"
(note the different apostrophes?)
Here's your code modified to make it work:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
d = {
u'aren’t': u'ain’t',
u'hello': u'hey'
}
#text = u"aren't"
text = u"aren’t"
def replace_all(text, d):
for i, k in d.iteritems():
#replace all whole words of I with K in lower cased text, regex = \bSTRING\b
text = re.sub(r"\b" + i + r"\b", k , text.lower())
return text
if __name__ == '__main__':
newtext = replace_all(text, d)
print newtext
Output:
ain’t
Upvotes: 3
Reputation: 400692
This works fine for me in Python 2.6.4:
>>> re.sub(ur'\baren\u2019t\b', 'rep', u'aren\u2019t')
u'rep'
Make sure that your pattern string is a Unicode string, otherwise it might not work.
Upvotes: 0