Reputation: 159
I'm running a simple script on my command line: echo "Alex " > alex.txt
len(open("alex.txt").read()) returns 16 instead of 5
When I run open("alex.txt").read()
I get:
ÿþA\x00l\x00e\x00x\x00 \x00\n\x00\n\x00
What is the issue?
Upvotes: 0
Views: 210
Reputation: 9005
The number of bytes in a file and the number of characters in a string are commonly different things.
Sticking to a limited set of characters, such as ASCII, you can get a one to one, but modern programming languages are more sophisticated than that, and at least attempt to serve a wider range of written languages.
You generally need to know what the encoding is. You may not get any indication in the file itself.
After reading the bytes, you need to encode those bytes into a string:
open("alex.txt","rb").read().decode('utf-16')
you can have open
do this for you, which is likely more reliable:
open("file.txt",encoding='utf-16').read()
Now, if you wanted to be fancy and get the encoding from the BOM, you can look at answers here:
Reading Unicode file data with BOM chars in Python
Upvotes: 2