ChamingaD
ChamingaD

Reputation: 2928

UnicodeDecodeError: 'ascii' codec can't decode byte - Python

This is related to following questions -

I have python app doing following tasks -

# -*- coding: utf-8 -*-

1. Read unicode text file (non-english) -

def readfile(file, access, encoding):
    with codecs.open(file, access, encoding) as f:
        return f.read()

text = readfile('teststory.txt','r','utf-8-sig')

This returns given text file as string.

2. Split text into sentences.

3. Go through words in each sentence and identify verbs, nouns etc.

Refer - Searching for Unicode characters in Python and Find word infront and behind of a Python list

4. Add them into separate variables as below

nouns = "CAR" | "BUS" |

verbs = "DRIVES" | "HITS"

5. Now I'm trying to pass them into NLTK context free grammer as below -

grammar = nltk.parse_cfg('''
    S -> NP VP
    NP -> N
    VP -> V | NP V

    N -> '''+nouns+'''
    V -> '''+verbs+'''
    ''')

It gives me following error -

line 40, in V -> '''+verbs+''' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 114: ordinal not in range(128)

How can i overcome this matter and pass variable into NLTK CFG ?

Complete Code - https://dl.dropboxusercontent.com/u/4959382/new.zip

Upvotes: 0

Views: 3035

Answers (1)

Dima Tisnek
Dima Tisnek

Reputation: 11781

Overall you have these strategies:

  • treat input as sequence of bytes, then both input and grammar are utf-8-encoded data (bytes)
  • treat input as sequence of unicode code points, then both input and grammar are unicode.
  • rename unicode code points to ascii, that is use escape sequences.

nltk that is installed with pip, 2.0.4 in my case, doesn't accept unicode directly, but accepts quoted unicode constants, that is all of the following appear to work:

In [26]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar')
Out[26]: <Grammar with 2 productions>

In [27]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("utf-8"))
Out[27]: <Grammar with 2 productions>

In [28]: nltk.parse_cfg(u'S -> "\N{EURO SIGN}" | bar'.encode("unicode_escape"))
Out[28]: <Grammar with 2 productions>

Note, that I quoted unicode text and not normal text "€" vs bar.

Upvotes: 1

Related Questions