Unable to process accented words using NLTK tokeniser

Question

I'm trying to compute the frequencies of words in an utf-8 encoded text file with the following code. Having successfully tokenized the file content and then looping through the words, my program is not able to read the accented characters.

import csv
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

print "computing word frequency..."
if lang == "fr":
    stop = stopwords.words("french")
    stop = [word.encode("utf-8") for word in stop]
    stop.append("les")
    stop.append("a")
elif lang == "en":
    stop = stopwords.words("english")


rb = csv.reader(open(path+file_name))
wb = csv.writer(open('results/all_words_'+file_name,'wb'))

tokenizer = RegexpTokenizer(r'\w+')

word_dict = {}

i = 0

for row in rb:
    i += 1
    if i == 5:
        break
    text = tokenizer.tokenize(row[0].lower())
    text = [j for j in text if j not in stop]
    #print text
    for doc in text:
        try:

            try:
                word_dict[doc] += 1

            except:

                word_dict[doc] = 1
        except:
            print row[0]
            print " ".join(text)




word_dict2 = sorted(word_dict.iteritems(), key=operator.itemgetter(1), reverse=True)

if lang == "English":
    for item in word_dict2:
        wb.writerow([item[0],stem(item[0]),item[1]])
else:
    for item in word_dict2:
        wb.writerow([item[0],item[1]])

print "Finished"

Input text file:

rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent

Results output:

crepes,2
dimanche,2
rt,1
nouveau,1
envie,1
v�,1 
jerrylee,1
cleantext,1
lo,1
bonnes,1
tour,1
crêpes,1
monde,1
bonjour,1
annesorose,1
envoy�,1

envoy� is envoyé in the actual file.

How can I correct this problem with accented characters?

alvas · Accepted Answer

If you're using py2.x, reset default encoding to 'utf8':

import sys
reload(sys)
sys.setdefaultencoding('utf8')

Alternatively, you can use a ucsv module, see see General Unicode/UTF-8 support for csv files in Python 2.6

or use io.open():

$ echo """rt annesorose envie crêpes
> envoyé jerrylee bonjour monde dimanche crepes dimanche
> The output written in a file is destroying certain words.
> bonnes crepes tour nouveau vélo
> aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent""" > someutf8.txt
$ python
>>> import io, csv
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read().split('
')
>>> for row in text:
...     print row
... 
rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent

Lastly, rather than using such a complex reading and counting module, simply use FreqDist in NLTK, see section 3.1 from http://www.nltk.org/book/ch01.html

Or personally, i prefer collections.Counter:

$ python
>>> import io
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read()
>>> from collections import Counter
>>> Counter(word_tokenize(text))
Counter({u'crepes': 2, u'dimanche': 2, u'fera': 1, u'certain': 1, u'is': 1, u'bonnes': 1, u'v\xe9lo': 1, u'batteries': 1, u'envoy\xe9': 1, u'vu': 1, u'file': 1, u'in': 1, u'The': 1, u'rt': 1, u'jerrylee': 1, u'destroying': 1, u'bien': 1, u'jours': 1, u'.': 1, u'written': 1, u'annesorose': 1, u'annoncent': 1, u'nouveau': 1, u'envie': 1, u'hard': 1, u'cr\xeapes': 1, u'\xe7a': 1, u'monde': 1, u'words': 1, u'bonjour': 1, u'a': 1, u'crepe': 1, u'soleil': 1, u'tour': 1, u'aime': 1, u'output': 1, u'recharger': 1})
>>> myFreqDist = Counter(word_tokenize(text))
>>> for word, freq in myFreqDist.items():
...     print word, freq
... 
fera 1
crepes 2
certain 1
is 1
bonnes 1
vélo 1
batteries 1
envoyé 1
vu 1
file 1
in 1
The 1
rt 1
jerrylee 1
destroying 1
bien 1
jours 1
. 1
written 1
dimanche 2
annesorose 1
annoncent 1
nouveau 1
envie 1
hard 1
crêpes 1
ça 1
monde 1
words 1
bonjour 1
a 1
crepe 1
soleil 1
tour 1
aime 1
output 1
recharger 1

Unable to process accented words using NLTK tokeniser

Answers (1)

Related Questions