Reputation: 1
This may be a newbie question, but here it goes. I have a large string
(167572 bytes) with both ASCII
and non ASCII characters. When I use len()
on the string I get the wrong length
. It seems that len()
doesn't count 0x0A characters. The only way I can get the actual length of the string is with this code:
for x in test:
totalLen += 1
for x in test:
if x == '\x0a':
totalLen += 1
print totalLen
What is wrong with len()
? Or am I using it wrong?
Upvotes: 0
Views: 152
Reputation: 1123930
You are confusing encoded byte strings with unicode text. In UTF-8, for example, up to 3 bytes are used to encode any given character, in UTF-16 each character is encoded using at least 2 bytes each.
A python string is a series of bytes, to get unicode you'd have to decode the string with an appropriate codec. If your text is encoded using UTF-8, for example, you can decode it with:
test = test.decode('utf8')
On the other hand, data written to a file is always encoded, so a unicode string of length 10 could take up 20 bytes in a file, if written using the UTF-16 codec.
Most likely you are getting confused with such 'wider' characters, not with wether or not your \n
(ASCII 10) characters are counted correctly.
Please do yourself a favour and read up on Unicode and encodings:
Upvotes: 6
Reputation: 400019
Could it be that you're expecting it to contain \r\n
, i.e. ASCII 13 (carriage return) followed by ASCII 10 (line feed), or that you look at the string once it's been written out to a text file, which adds these?
It's hard to be specific since you don't give a lot of detail, i.e. where the string's data comes from.
Upvotes: 4