Johnny Karlsson
Johnny Karlsson

Reputation: 1

Large strings and len()

This may be a newbie question, but here it goes. I have a large string (167572 bytes) with both ASCII and non ASCII characters. When I use len() on the string I get the wrong length. It seems that len() doesn't count 0x0A characters. The only way I can get the actual length of the string is with this code:

for x in test:  
   totalLen += 1   
for x in test:  
   if x == '\x0a':  
      totalLen += 1  
print totalLen  

What is wrong with len()? Or am I using it wrong?

Upvotes: 0

Views: 152

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1123930

You are confusing encoded byte strings with unicode text. In UTF-8, for example, up to 3 bytes are used to encode any given character, in UTF-16 each character is encoded using at least 2 bytes each.

A python string is a series of bytes, to get unicode you'd have to decode the string with an appropriate codec. If your text is encoded using UTF-8, for example, you can decode it with:

test = test.decode('utf8')

On the other hand, data written to a file is always encoded, so a unicode string of length 10 could take up 20 bytes in a file, if written using the UTF-16 codec.

Most likely you are getting confused with such 'wider' characters, not with wether or not your \n (ASCII 10) characters are counted correctly.

Please do yourself a favour and read up on Unicode and encodings:

Upvotes: 6

unwind
unwind

Reputation: 400019

Could it be that you're expecting it to contain \r\n, i.e. ASCII 13 (carriage return) followed by ASCII 10 (line feed), or that you look at the string once it's been written out to a text file, which adds these?

It's hard to be specific since you don't give a lot of detail, i.e. where the string's data comes from.

Upvotes: 4

Related Questions