Loewe8
Loewe8

Reputation: 51

Editing UTF-8 text file on Windows

I'm trying to manipulate a text file with song names. I want to clean up the data, by changing all the spaces and tabs into +.

This is the code:

input = open('music.txt', 'r')
out = open("out.txt", "w")
for line in input:
    new_line = line.replace(" ", "+")
    new_line2 = new_line.replace("\t", "+")
    out.write(new_line2)
    #print(new_line2)
fh.close()
out.close()

It gives me an error:

Traceback (most recent call last):
  File "music.py", line 3, in <module>
    for line in input:
  File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2126: character maps to <undefined>

As music.txt is saved in UTF-8, I changed the first line to:

input = open('music.txt', 'r', encoding="utf8")

This gives another error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u039b' in position 21: character maps to <undefined>

I tried other things with the out.write() but it didn't work.

This is the raw data of music.txt. https://pastebin.com/FVsVinqW

I saved it in windows editor as UTF-8 .txt file.

Upvotes: -1

Views: 202

Answers (1)

tripleee
tripleee

Reputation: 189749

If your system's default encoding is not UTF-8, you will need to explicitly configure it for both the filehandles you open, on legacy versions of Python 3 on Windows.

with open('music.txt', 'r', encoding='utf-8') as infh,\
        open("out.txt", "w", encoding='utf-8') as outfh:
    for line in infh:
        line = line.replace(" ", "+").replace("\t", "+")
        outfh.write(line)

This demonstrates how you can use fewer temporary variables for the replacements; I also refactored to use a with context manager, and renamed the file handle variables to avoid shadowing the built-in input function.

Going forward, perhaps a better solution would be to upgrade your Python version; my understanding is that Python should now finally offer UTF-8 by default on Windows, too.

Upvotes: 1

Related Questions