Reputation: 363
I've got an issue with iterating through unicode strings, character by character, with python.
print "w: ",word
for c in word:
print "word: ",c
This is my output
w: 文本
word: ?
word: ?
word: ?
word: ?
word: ?
word: ?
My desired output is:
文
本
When I use len(word) I get 6. Apparently each character is 3 unicode chunks.
So, my unicode string is successfully stored in the variable, but I cannot get the characters out. I have tried using encode('utf-8'), decode('utf-8) and codecs but still cannot obtain any good results. This seems like a simple problem but is frustratingly hard for me.
Hope someone can point me to the right direction.
Thanks!
Upvotes: 10
Views: 7478
Reputation: 5075
For python 3 this is what works:
import unicodedata
word = "文本"
word = unicodedata.normalize('NFC', word)
for char in word:
print(char)
Upvotes: 1
Reputation: 46
you should convert the word from string type to unicode:
print "w: ",word
for c in word.decode('utf-8'):
print "word: ",c
Upvotes: 2
Reputation: 363
The code I used which works is this
fileContent = codecs.open('fileName.txt','r',encoding='utf-8')
#...split by whitespace to get words..
for c in word:
print(c.encode('utf-8'))
Upvotes: 2
Reputation: 3036
# -*- coding: utf-8 -*-
word = "文本"
print(word)
for each in unicode(word,"utf-8"):
print(each)
Output:
文本
文
本
Upvotes: 17