Reputation: 8370
I'm writing a program that takes in a textfile and producing another textfile where: 1. swedish letters are formatted correctly. 2. All words that are not alphabetic are removed. 3. All capital letters have been converted to lowercase letters.
This is my code:
import string
infile = open("unigram.wfreq","r")
outfile = open("bigram.txt","w")
line = "Start"
while line != "":
line = infile.readline()
wordandcount = line.split()
word = wordandcount[0]
##Fix å ä ö.
## å == √• ä == √§ ö == √∂
if "å" in word or "ä" in word or "ö" in word:
word = word.replace("√•","å")
word = word.replace("√§","ä")
word = word.replace("√∂","ö")
if word.isalpha():
word = word.lower()
outfile.write(word+"\n")
print(line)
And here is a sample of my unigram.wordfreq file:
gruppselektion 4
lating 1
Morsing 2
varuhusen 7
FULLT 8
latino 3
mammutslätten 2
föglömma 1
varuhuset 47
livsnjutningen 1
nedtoning 1
When I run the file, I get the following error:
Traceback (most recent call last):
File "formater.py", line 13, in <module>
line = infile.readline()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 2732-2733: invalid continuation byte
If I look at the end of the terminal output I see the following:
Omgångsstarten 1
nationssplittring 1
Handtvätten 1
Three 47
domherre 1
http://www.dryden.se 1
Getryggarna 1
mineraloljor 21
If I find this segment in the unigram.wordfreq file I expect to see the word that generated the error right after mineraloljor (right?), but I see this:
Getryggarna 1
mineraloljor 21
MAYHEM 1
avvänjer 1
tilltrasslad 1
EUROPEISKT 1
Right after mineraloljor
, there is MAYHEM
. I don't see why this word should cause an error, there is nothing different about it!
How can I solve this error and continue the formatting of the entire file?
Upvotes: 0
Views: 705
Reputation: 27734
If föglömma
is in your sample file and is supposed to read föglömma
but your Python script doesn't think it's UTF-8, then you've mojibaked the wrong encoding into your unigram.wfreq
file.
At some point UTF-8 data has been interpreted as mac-roman then saved as mac-roman.
By saving the file again to UTF-8, you've further baked-in your previous errors.
Upvotes: 0
Reputation: 177991
It looks like the file is encoded UTF-8, but you are displaying it using mac_roman
encoding. Here's a test:
#coding:utf8
data = u'mammutslätten föglömma'
print data.encode('utf8').decode('mac_roman')
Output:
mammutslätten föglömma
To read the file properly in Python, use the following to read Unicode strings using the correct encoding:
import io
with io.open('unigram.wfreq',encoding='utf8') as f:
for line in f:
print line.strip()
Output:
gruppselektion 4
lating 1
Morsing 2
varuhusen 7
FULLT 8
latino 3
mammutslätten 2
föglömma 1
varuhuset 47
livsnjutningen 1
nedtoning 1
Upvotes: 0
Reputation: 8370
So I found a simple solution to this problem. I opened my wfreq file with sublime text 2 where I can save it with the encoding utf-8. This fixed the Swedish letter problem all by itself. I also changed the extension to .txt. After that I ran the python code again (with changed file names and å ä ö-part removed) and it worked fine.
Upvotes: 2