Reputation: 452
Why is my Python showing such a weird pattern when I open my .txt file and how can I load it in normally (I use Python 3.5)? In other words, I want to get rid of the \x00 after every letter that I have.
In:
f = open(file_path, encoding="utf-8", errors="ignore")
read_data = f.read()
read_data[0:100]
Out:
'H\x00i\x00e\x00r\x00b\x00i\x00j\x00 \x00w\x00i\x00l\x00 \x00i\x00k\x00 \x00u\x00 \x00m\x00e\x00d\x00e\x00d\x00e\x00l\x00e\x00n\x00,\x00 \x00d\x00a\x00t\x00 \x00i\x00k\x00 \x00m\x00i\x00j\x00n\x00 \x00s\x00p\x00a\x00a\x00r\x00r\x00e\x00k\x00e\x00n\x00'
An example of the file when I open it in my Notepad:
Hierbij wil ik u mededelen, dat ik mijn spaarrekening onder nummer __LARGENUMBER__ wil beëindigen.
Graag maak ik van de gelegenheid [... row continues]
Hierbij verzoek ik u de volgende rekening op te [... row continues]
Upvotes: 0
Views: 551
Reputation: 15887
Your text most likely simply isn't encoded using UTF-8, but UTF-16 or maybe UCS-2. That means all those NULs are in fact part of the character code before. Verifying it might require a character that uses the second byte, like a BOM or perhaps that ë. Try using utf_16_le
as encoding when reading the file.
Upvotes: 4