Reputation: 2238
I have this little script, which is basically just a test I'm doing for a larger program. I have a problem with the encoding. When I try to write to file the utf-8 characters, such as øæå, are not encoded properly. Why is that, and how can I solve this issue?
#!/usr/bin/python
# -*- coding: utf-8 -*-
import nltk
from nltk.collocations import *
collocations = open('bam.txt', 'w')
bigram_measures = nltk.collocations.BigramAssocMeasures()
tokens = nltk.wordpunct_tokenize("Hei på deg øyerusk, du er meg en gammel dust, neida neida, det er ikke helt sant da."
"Men du, hvorfor så brusk, ikke klut i din susk på en runkete lust")
finder = BigramCollocationFinder.from_words(tokens)
# finder.apply_freq_filter(3)
scored = finder.score_ngrams(bigram_measures.raw_freq)
for i in scored:
print i[0][0] + ' ' + i[0][1] + ': ' + str(i[1]) + '\n'
collocations.write(i[0][0] + ' ' + i[0][1] + ': ' + str(i[1]) + '\n')
collocations.close()
Upvotes: 1
Views: 470
Reputation: 167
There are any number of reasons why the encoding isn't working properly. Unicode is a vast and varied mess. The Python HOWTO on Unicode is somewhat helpful for background info: https://docs.python.org/3/howto/unicode.html
When I just need stuffy to work, I've had success forcing encodings into unicode by using ftfy
, available on PyPi: https://pypi.python.org/pypi/ftfy/3.3.0
Example usage:
>>>import ftfy
>>> print(ftfy.fix_text('ünicode'))
ünicode
>>>print(ftfy.fix_text_encoding('AHÅ™, the new sofa from IKEA®'))
AHÅ™, the new sofa from IKEA®
Upvotes: 1
Reputation: 8709
The thing is nltk.wordpunct_tokenize
doesn't work with non-ascii data. It is better to use PunktWordTokenizer
from nltk.tokenize.punkt
. So import is as:
from nltk.tokenize.punkt import PunktWordTokenizer as PT
and replace:
tokens = nltk.wordpunct_tokenize("Hei på deg øyerusk, du er meg en gammel dust, neida neida, det er ikke helt sant da."
"Men du, hvorfor så brusk, ikke klut i din susk på en runkete lust")
with:
tokens = PT().tokenize("Hei på deg øyerusk, du er meg en gammel dust, neida neida, det er ikke helt sant da."
"Men du, hvorfor så brusk, ikke klut i din susk på en runkete lust")
Upvotes: 3