Alex
Alex

Reputation: 159

len() returns wrong number for string

I'm running a simple script on my command line: echo "Alex " > alex.txt

len(open("alex.txt").read()) returns 16 instead of 5

When I run open("alex.txt").read() I get:

ÿþA\x00l\x00e\x00x\x00 \x00\n\x00\n\x00

What is the issue?

Upvotes: 0

Views: 210

Answers (1)

Geoduck
Geoduck

Reputation: 9005

The number of bytes in a file and the number of characters in a string are commonly different things.

Sticking to a limited set of characters, such as ASCII, you can get a one to one, but modern programming languages are more sophisticated than that, and at least attempt to serve a wider range of written languages.

You generally need to know what the encoding is. You may not get any indication in the file itself.

After reading the bytes, you need to encode those bytes into a string:

open("alex.txt","rb").read().decode('utf-16')

you can have open do this for you, which is likely more reliable:

open("file.txt",encoding='utf-16').read()

Now, if you wanted to be fancy and get the encoding from the BOM, you can look at answers here:

Reading Unicode file data with BOM chars in Python

Upvotes: 2

Related Questions