LaughingMan
LaughingMan

Reputation: 650

wrong utf-8 characters when writing to file (python)

I am having a problem regarding encoding that seems very similar to other problems here, but not quite the same, and I cannot figure this whole thing out.

I thought I had grasped the concept of encoding, but I have these special characters (æ,ø,å,ö, etc.) that looks fine when printing, but cannot be written to a file. (e.g. æ becomes Ê when I write to the file)

my code is as follows:

def sortWords(subject, articles, stopWordsFile):
    stopWords = [] 
    f = open(stopWordsFile)
    for lines in f:
        stopWords.append(lines.split(None, 1)[0].lower())

    for x in range(0,len(articles)):
        f = open(articles[x], 'r')
        article = f.read().lower()
        article = re.sub("[^a-zA-Z\æøåÆØÅöÖüÜ\ ]+", " ", article)
        article = [word for word in article.split() if word not in stopWords]
        print ' '.join(article)
        w = codecs.open(subject+str(x)+'.txt', 'w+')
        w.write(' '.join(article))



sortWords("hpv", ["vaccine_texts/hpv1.txt"], "stopwords.txt")

I have tried with various encodings, opening the files with codecs.open(file, r, 'utf-8'), but to no avail. What am I missing here?

I'm on ubuntu (switched from Windows because its terminal wouldn't output correctly)

Upvotes: 0

Views: 4073

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 148870

When you see in the text file something like Ê (or more generally 2 characters the first of which is Ã) it is likely that the file was correctly written in UTF8, and that the editor (or the screen) does not process correctly UTF8.

Let's look at æ. It is the unicode character U+E6. When you encode it in utf8, it gives the two characters b'\xc3\xa6' and when decoded as latin1 it printf 'æ'.

What can you do to confirm? Use the excellent vim editor that knows about multiple encoding and among others utf8, at least when you use its graphical interface gvim.

And just another general advice: never write non ascii characters in a python source file, unless you put a # -*- coding: ... -*- line as first line (or second if first is a hashbang one #! /usr/bin/env python)

And if you want to use unicode under Windows with Python, do use IDLE that natively processes it.

TL/DR: If you are using Linux, it is likely that your system is configured natively to use utf8 encoding, and you correctly write your text files in utf8 but your text editor just fails to display utf8 correctly

Upvotes: 2

YOBA
YOBA

Reputation: 2807

have you tried:

w.write( ' '.join(article).encode('utf8') )

and don't forget to close your files (better use with context manager to manipulate files)

Upvotes: 0

Related Questions