Jan Janiszewski
Jan Janiszewski

Reputation: 452

Bytes after every letter. How to get rid of them

Why is my Python showing such a weird pattern when I open my .txt file and how can I load it in normally (I use Python 3.5)? In other words, I want to get rid of the \x00 after every letter that I have.

In:
f = open(file_path, encoding="utf-8", errors="ignore")
read_data = f.read()
read_data[0:100]

Out:
'H\x00i\x00e\x00r\x00b\x00i\x00j\x00 \x00w\x00i\x00l\x00 \x00i\x00k\x00 \x00u\x00 \x00m\x00e\x00d\x00e\x00d\x00e\x00l\x00e\x00n\x00,\x00 \x00d\x00a\x00t\x00 \x00i\x00k\x00 \x00m\x00i\x00j\x00n\x00 \x00s\x00p\x00a\x00a\x00r\x00r\x00e\x00k\x00e\x00n\x00'

An example of the file when I open it in my Notepad:

Hierbij wil ik u mededelen, dat ik mijn spaarrekening onder nummer __LARGENUMBER__ wil beëindigen.                                                                          
Graag maak ik van de gelegenheid [... row continues]                    
Hierbij verzoek ik u de volgende rekening op te [... row continues]

Upvotes: 0

Views: 551

Answers (1)

Yann Vernier
Yann Vernier

Reputation: 15887

Your text most likely simply isn't encoded using UTF-8, but UTF-16 or maybe UCS-2. That means all those NULs are in fact part of the character code before. Verifying it might require a character that uses the second byte, like a BOM or perhaps that ë. Try using utf_16_le as encoding when reading the file.

Upvotes: 4

Related Questions