Reputation: 1011
I have a text scraped from internet (I think it was a Spanish text encoded in "latin-1" and decoded to unicode when scraped). The text is something like this:
730\u20ac.\r\n\nropa nueva 2012 ... 5,10 muy buen estado..... 170 \u20ac\r\n\nPack 850\u20ac,
After that I do some replacements on the text to normalize some words (i.e. replace the € symbol (\u20ac) for "euros" using regex (r'\u20ac', r' euros')).
Here my problem seems to start... If I do not encode each string to "UTF-8" before applying the regex, the regex wont find any occurrences (despite a lot of occurrences do exist)...
Anyways, after encoding it to UTF-8, the regex (r'\u20ac', r' euros') works.
After that I tokenize and tag all the strings. When I try to use the regexparser I then get the
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
My question is, if I have already encoded it to UTF-8, how come I have a problem now? And what would be your suggestion to try to avoid it?
Is there a way to do the encoding process once and for all, like below? If so what should I do for the second part (encode/ decode it anyway)?
Get text -> encode/ decode it anyway... -> Work on the text without any issue
Thanks in advance for any help!! I am new to programming and it is killing me...
Code detail:
regex function
replacement_patterns = [(ur' \\u20ac', ur' euros'),(ur' \xe2\x82\xac', r' euros'),(ur' \b[eE]?[uU]?[rR]\b', r' euros'), (ur' \b([0-9]+)[eE][uU]?[rR]?[oO]?[sS]?\b',ur' \1 euros')]
class RegexpReplacer(object):
def __init__(self, patterns=replacement_patterns):
self.patterns = [(re.compile(regex, re.IGNORECASE), repl) for (regex, repl) in patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.patterns:
(s, count) = re.subn(pattern, repl, s)
return s
Upvotes: 1
Views: 2288
Reputation: 3225
Did you use the decode & encode functions correctly?
from nltk import ne_chunk,pos_tag
from nltk.tokenize.punkt import PunktSentenceTokenizer
from nltk.tokenize.treebank import TreebankWordTokenizer
text = "€"
text = text.decode('utf-8')
sentences = PunktTokenizer.tokenize(text)
tokens = [TreeBankTokenizer.tokenize(sentence) for sentence in sentences]
tagged = [pos_tag(token) for token in tokens]
When needed, try to use:
print your_string.encode("utf-8")
I have no problems currently. The only issue is that $50, says:
word: $ meaning: dollar word: 50 meaning: numeral, cardinal
This is correct. And €50, says:
word: €50 meaning: -NONE-
This is INcorrect.
With a space between the € sign and the number, it says:
word: € meaning: noun, common, singular or mass word: 50 meaning: numeral, cardinal
Which is more correct.
Upvotes: 0
Reputation: 20300
You seem to be misunderstanding the meaning of r'\u20ac'
The r
indicates a raw string. Not a unicode string, a standard one. So using a unicode escape in a pattern only gets you a literal backslash:
>>> p = re.compile(r'\u20ac')
>>> p.pattern
'\\u20ac'
>>> print p.pattern
\u20ac
If you want to use raw strings and unicode escapes, you'll have to use raw unicode strings, indicated by ur
instead of just r
:
>>> p = re.compile(ur'\u20ac')
>>> p.pattern
u'\u20ac'
>>> print p.pattern
€
Upvotes: 1