charpi
charpi

Reputation: 363

Iterating through a unicode string in Python

I've got an issue with iterating through unicode strings, character by character, with python.

print "w: ",word
for c in word:
    print "word: ",c

This is my output

w:  文本
word:  ? 
word:  ?
word:  ?
word:  ?
word:  ?
word:  ?

My desired output is:

文
本

When I use len(word) I get 6. Apparently each character is 3 unicode chunks.

So, my unicode string is successfully stored in the variable, but I cannot get the characters out. I have tried using encode('utf-8'), decode('utf-8) and codecs but still cannot obtain any good results. This seems like a simple problem but is frustratingly hard for me.

Hope someone can point me to the right direction.

Thanks!

Upvotes: 10

Views: 7478

Answers (4)

DevB2F
DevB2F

Reputation: 5075

For python 3 this is what works:

import unicodedata

word = "文本"
word = unicodedata.normalize('NFC', word)
for char in word:
    print(char)

Upvotes: 1

Tsing
Tsing

Reputation: 46

you should convert the word from string type to unicode:

print "w: ",word
for c in word.decode('utf-8'):
    print "word: ",c

Upvotes: 2

charpi
charpi

Reputation: 363

The code I used which works is this

fileContent = codecs.open('fileName.txt','r',encoding='utf-8')
#...split by whitespace to get words..
for c in word:
        print(c.encode('utf-8'))

Upvotes: 2

Pruthvi Raj
Pruthvi Raj

Reputation: 3036

# -*- coding: utf-8 -*-
word = "文本"
print(word)
for each in unicode(word,"utf-8"):
    print(each)

Output:

文本
文
本

Upvotes: 17

Related Questions