Arash Saidi
Arash Saidi

Reputation: 2238

Utf-8 issues with python and nltk

I have this little script, which is basically just a test I'm doing for a larger program. I have a problem with the encoding. When I try to write to file the utf-8 characters, such as øæå, are not encoded properly. Why is that, and how can I solve this issue?

#!/usr/bin/python
# -*- coding: utf-8 -*-

import nltk
from nltk.collocations import *

collocations = open('bam.txt', 'w')
bigram_measures = nltk.collocations.BigramAssocMeasures()
tokens = nltk.wordpunct_tokenize("Hei på deg øyerusk, du er meg en gammel dust, neida neida, det er ikke helt sant da."
                                 "Men du, hvorfor så brusk, ikke klut i din susk på en runkete lust")
finder = BigramCollocationFinder.from_words(tokens)
# finder.apply_freq_filter(3)
scored = finder.score_ngrams(bigram_measures.raw_freq)
for i in scored:
    print i[0][0] + ' ' + i[0][1] + ': ' + str(i[1]) + '\n'
    collocations.write(i[0][0] + ' ' + i[0][1] + ': ' + str(i[1]) + '\n')

collocations.close()

Upvotes: 1

Views: 470

Answers (2)

Merjit
Merjit

Reputation: 167

There are any number of reasons why the encoding isn't working properly. Unicode is a vast and varied mess. The Python HOWTO on Unicode is somewhat helpful for background info: https://docs.python.org/3/howto/unicode.html

When I just need stuffy to work, I've had success forcing encodings into unicode by using ftfy, available on PyPi: https://pypi.python.org/pypi/ftfy/3.3.0

Example usage:

>>>import ftfy
>>> print(ftfy.fix_text('ünicode'))
ünicode

>>>print(ftfy.fix_text_encoding('AHÅ™, the new sofa from IKEA®'))
AHÅ™, the new sofa from IKEA®

Upvotes: 1

Irshad Bhat
Irshad Bhat

Reputation: 8709

The thing is nltk.wordpunct_tokenize doesn't work with non-ascii data. It is better to use PunktWordTokenizer from nltk.tokenize.punkt. So import is as:

from nltk.tokenize.punkt import PunktWordTokenizer as PT

and replace:

tokens = nltk.wordpunct_tokenize("Hei på deg øyerusk, du er meg en gammel dust, neida neida, det er ikke helt sant da."
                             "Men du, hvorfor så brusk, ikke klut i din susk på en runkete lust")

with:

tokens = PT().tokenize("Hei på deg øyerusk, du er meg en gammel dust, neida neida, det er ikke helt sant da." 
                             "Men du, hvorfor så brusk, ikke klut i din susk på en runkete lust")

Upvotes: 3

Related Questions