Sahand
Sahand

Reputation: 8370

UnicodeDecodeError when altering text file

I'm writing a program that takes in a textfile and producing another textfile where: 1. swedish letters are formatted correctly. 2. All words that are not alphabetic are removed. 3. All capital letters have been converted to lowercase letters.

This is my code:

import string

infile = open("unigram.wfreq","r")
outfile = open("bigram.txt","w")

line = "Start"
while line != "":
    line = infile.readline()
    wordandcount = line.split()
    word = wordandcount[0]
    ##Fix å ä ö.
    ## å == √• ä == √§ ö == √∂
    if "å" in word or "ä" in word or "ö" in word:
        word = word.replace("√•","å")
        word = word.replace("√§","ä")
        word = word.replace("√∂","ö")
    if word.isalpha():
        word = word.lower()
        outfile.write(word+"\n")
    print(line)

And here is a sample of my unigram.wordfreq file:

gruppselektion 4
lating 1
Morsing 2
varuhusen 7
FULLT 8
latino 3
mammutslätten 2
föglömma 1
varuhuset 47
livsnjutningen 1
nedtoning 1

When I run the file, I get the following error:

Traceback (most recent call last):
  File "formater.py", line 13, in <module>
    line = infile.readline()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 2732-2733: invalid continuation byte

If I look at the end of the terminal output I see the following:

Omgångsstarten 1

nationssplittring 1

Handtvätten 1

Three 47

domherre 1

http://www.dryden.se 1

Getryggarna 1

mineraloljor 21

If I find this segment in the unigram.wordfreq file I expect to see the word that generated the error right after mineraloljor (right?), but I see this:

Getryggarna 1
mineraloljor 21
MAYHEM 1
avvänjer 1
tilltrasslad 1
EUROPEISKT 1

Right after mineraloljor, there is MAYHEM. I don't see why this word should cause an error, there is nothing different about it!

How can I solve this error and continue the formatting of the entire file?

Upvotes: 0

Views: 705

Answers (3)

Alastair McCormack
Alastair McCormack

Reputation: 27734

If f√∂gl√∂mma is in your sample file and is supposed to read föglömma but your Python script doesn't think it's UTF-8, then you've mojibaked the wrong encoding into your unigram.wfreq file.

At some point UTF-8 data has been interpreted as mac-roman then saved as mac-roman.

By saving the file again to UTF-8, you've further baked-in your previous errors.

Upvotes: 0

Mark Tolonen
Mark Tolonen

Reputation: 177991

It looks like the file is encoded UTF-8, but you are displaying it using mac_roman encoding. Here's a test:

#coding:utf8
data = u'mammutslätten föglömma'
print data.encode('utf8').decode('mac_roman')

Output:

mammutslätten föglömma

To read the file properly in Python, use the following to read Unicode strings using the correct encoding:

import io
with io.open('unigram.wfreq',encoding='utf8') as f:
    for line in f:
        print line.strip()

Output:

gruppselektion 4
lating 1
Morsing 2
varuhusen 7
FULLT 8
latino 3
mammutslätten 2
föglömma 1
varuhuset 47
livsnjutningen 1
nedtoning 1

Upvotes: 0

Sahand
Sahand

Reputation: 8370

So I found a simple solution to this problem. I opened my wfreq file with sublime text 2 where I can save it with the encoding utf-8. This fixed the Swedish letter problem all by itself. I also changed the extension to .txt. After that I ran the python code again (with changed file names and å ä ö-part removed) and it worked fine.

Upvotes: 2

Related Questions