Reputation: 3511
I'm trying to compute the frequencies of words in an utf-8 encoded text file with the following code. Having successfully tokenized the file content and then looping through the words, my program is not able to read the accented characters.
import csv
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
print "computing word frequency..."
if lang == "fr":
stop = stopwords.words("french")
stop = [word.encode("utf-8") for word in stop]
stop.append("les")
stop.append("a")
elif lang == "en":
stop = stopwords.words("english")
rb = csv.reader(open(path+file_name))
wb = csv.writer(open('results/all_words_'+file_name,'wb'))
tokenizer = RegexpTokenizer(r'\w+')
word_dict = {}
i = 0
for row in rb:
i += 1
if i == 5:
break
text = tokenizer.tokenize(row[0].lower())
text = [j for j in text if j not in stop]
#print text
for doc in text:
try:
try:
word_dict[doc] += 1
except:
word_dict[doc] = 1
except:
print row[0]
print " ".join(text)
word_dict2 = sorted(word_dict.iteritems(), key=operator.itemgetter(1), reverse=True)
if lang == "English":
for item in word_dict2:
wb.writerow([item[0],stem(item[0]),item[1]])
else:
for item in word_dict2:
wb.writerow([item[0],item[1]])
print "Finished"
Input text file:
rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent
Results output:
crepes,2
dimanche,2
rt,1
nouveau,1
envie,1
v�,1
jerrylee,1
cleantext,1
lo,1
bonnes,1
tour,1
crêpes,1
monde,1
bonjour,1
annesorose,1
envoy�,1
envoy� is envoyé in the actual file.
How can I correct this problem with accented characters?
Upvotes: 1
Views: 1584
Reputation: 122042
If you're using py2.x, reset default encoding to 'utf8':
import sys
reload(sys)
sys.setdefaultencoding('utf8')
Alternatively, you can use a ucsv
module, see see General Unicode/UTF-8 support for csv files in Python 2.6
or use io.open()
:
$ echo """rt annesorose envie crêpes
> envoyé jerrylee bonjour monde dimanche crepes dimanche
> The output written in a file is destroying certain words.
> bonnes crepes tour nouveau vélo
> aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent""" > someutf8.txt
$ python
>>> import io, csv
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read().split('\n')
>>> for row in text:
... print row
...
rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent
Lastly, rather than using such a complex reading and counting module, simply use FreqDist
in NLTK, see section 3.1 from http://www.nltk.org/book/ch01.html
Or personally, i prefer collections.Counter:
$ python
>>> import io
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read()
>>> from collections import Counter
>>> Counter(word_tokenize(text))
Counter({u'crepes': 2, u'dimanche': 2, u'fera': 1, u'certain': 1, u'is': 1, u'bonnes': 1, u'v\xe9lo': 1, u'batteries': 1, u'envoy\xe9': 1, u'vu': 1, u'file': 1, u'in': 1, u'The': 1, u'rt': 1, u'jerrylee': 1, u'destroying': 1, u'bien': 1, u'jours': 1, u'.': 1, u'written': 1, u'annesorose': 1, u'annoncent': 1, u'nouveau': 1, u'envie': 1, u'hard': 1, u'cr\xeapes': 1, u'\xe7a': 1, u'monde': 1, u'words': 1, u'bonjour': 1, u'a': 1, u'crepe': 1, u'soleil': 1, u'tour': 1, u'aime': 1, u'output': 1, u'recharger': 1})
>>> myFreqDist = Counter(word_tokenize(text))
>>> for word, freq in myFreqDist.items():
... print word, freq
...
fera 1
crepes 2
certain 1
is 1
bonnes 1
vélo 1
batteries 1
envoyé 1
vu 1
file 1
in 1
The 1
rt 1
jerrylee 1
destroying 1
bien 1
jours 1
. 1
written 1
dimanche 2
annesorose 1
annoncent 1
nouveau 1
envie 1
hard 1
crêpes 1
ça 1
monde 1
words 1
bonjour 1
a 1
crepe 1
soleil 1
tour 1
aime 1
output 1
recharger 1
Upvotes: 2