Reputation: 1011

Unicode issues when using NLTK

I have a text scraped from internet (I think it was a Spanish text encoded in "latin-1" and decoded to unicode when scraped). The text is something like this:

730\u20ac.\r\n\nropa nueva 2012 ... 5,10 muy buen estado..... 170 \u20ac\r\n\nPack 850\u20ac,

After that I do some replacements on the text to normalize some words (i.e. replace the € symbol (\u20ac) for "euros" using regex (r'\u20ac', r' euros')).

Here my problem seems to start... If I do not encode each string to "UTF-8" before applying the regex, the regex wont find any occurrences (despite a lot of occurrences do exist)...

Anyways, after encoding it to UTF-8, the regex (r'\u20ac', r' euros') works.

After that I tokenize and tag all the strings. When I try to use the regexparser I then get the

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)

My question is, if I have already encoded it to UTF-8, how come I have a problem now? And what would be your suggestion to try to avoid it?

Is there a way to do the encoding process once and for all, like below? If so what should I do for the second part (encode/ decode it anyway)?

Get text -> encode/ decode it anyway... -> Work on the text without any issue

Thanks in advance for any help!! I am new to programming and it is killing me...

Code detail:

regex function

replacement_patterns = [(ur' \\u20ac', ur'  euros'),(ur' \xe2\x82\xac', r'  euros'),(ur' \b[eE]?[uU]?[rR]\b', r'  euros'), (ur' \b([0-9]+)[eE][uU]?[rR]?[oO]?[sS]?\b',ur' \1 euros')]

class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex, re.IGNORECASE), repl) for (regex, repl) in patterns]

    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            (s, count) = re.subn(pattern, repl, s)
        return s

Upvotes: 1

Answers (2)

Melroy van den Berg

Reputation: 3225

Did you use the decode & encode functions correctly?

from nltk import ne_chunk,pos_tag
from nltk.tokenize.punkt import PunktSentenceTokenizer
from nltk.tokenize.treebank import TreebankWordTokenizer


text = "€"
text = text.decode('utf-8')
sentences = PunktTokenizer.tokenize(text)
tokens = [TreeBankTokenizer.tokenize(sentence) for sentence in sentences]
tagged = [pos_tag(token) for token in tokens]

When needed, try to use:

print your_string.encode("utf-8")

I have no problems currently. The only issue is that $50, says:

word: $ meaning: dollar word: 50 meaning: numeral, cardinal

This is correct. And €50, says:

word: €50 meaning: -NONE-

This is INcorrect.

With a space between the € sign and the number, it says:

word: € meaning: noun, common, singular or mass word: 50 meaning: numeral, cardinal

Which is more correct.

Upvotes: 0

beerbajay

Reputation: 20300

You seem to be misunderstanding the meaning of r'\u20ac'

The r indicates a raw string. Not a unicode string, a standard one. So using a unicode escape in a pattern only gets you a literal backslash:

>>> p = re.compile(r'\u20ac')
>>> p.pattern
'\\u20ac'
>>> print p.pattern
\u20ac

If you want to use raw strings and unicode escapes, you'll have to use raw unicode strings, indicated by ur instead of just r:

>>> p = re.compile(ur'\u20ac')
>>> p.pattern
u'\u20ac'
>>> print p.pattern
€

Upvotes: 1

Unicode issues when using NLTK

Answers (2)

Related Questions