Reputation: 11
I downloaded Spanish text from NLTK in python using
spanish_sents=nltk.corpus.floresta.sents()
when printing the sentences in the terminal the corresponding Spanish characters
are not rendered. For example printing spanish_sents[1]
produces characters like u'\xe9'
and if I encode it using utf-8 as in
print [x.encode("utf-8") for x in sapnish_sents[1]]
it produces '\xc3\xa9'
and encoding in latin3
print [x.encode("latin3") for x in sapnish_sents[1]]
it produces '\xe9'
How can I configure my terminal to print the glyphs for these points? Thanks
Upvotes: 1
Views: 5688
Reputation: 82028
My guess is that there are a few things going on. First, you're iterating through a str
(is sapnish_sents[1]
one entire entry? What happens when you print that). Second, you're not getting full characters because you're iterating through a str
(a unicode character takes more "space" than an ASCII character, so addressing a single index will look weird). Third you are trying to encode
when you probably mean to decode
.
Try this:
print sapnish_sents[1].decode('utf-8')
I just ran the following in my terminal to help give context:
>>> a = '®†\¨ˆø' # Storing non-ASCII characters in a str is ill-advised;
# I do this as an example because it's what I think your question is
# really asking
>>> a # a now looks like a bunch of gibberish if I just output
'\xc2\xae\xe2\x80\xa0\\\xc2\xa8\xcb\x86\xc3\xb8'
>>> print a # Well, this looks normal.
®†\¨ˆø
>>> print repr(a) # Just demonstrating how the above works
'\xc2\xae\xe2\x80\xa0\\\xc2\xa8\xcb\x86\xc3\xb8'
>>> a[0] # We're only looking at one character, which is represented by all this stuff.
'\xc2'
>>> print a[0] # But because it's not a complete unicode character, the terminal balks
?
>>> print a.decode('utf-8') # Look familiar?
®†\¨ˆø
>>> print a.decode('utf-8')[0] # Our first character!
®
Upvotes: 0
Reputation: 148965
Just an initial remark, Latin3 or ISO-8859-3 is indeed denoted as South European, but it was designed to cover Turkish, Maltese and Esperanto. Spanish is more commonly encoded in Latin1 (ISO-8859-1 or West European) or Latin9 (ISO-8859-15).
I can confirm that the letter é
has the unicode code point U+00E9, and is represented as '\xe9'
in both Latin1 and Latin3. And it is encoded as '\xc3\xc9'
in UTF8, so all your conversions are correct.
But the real question How can I configure my terminal... ? is hard to answer without knowing what the terminal is...
chcp 1252
and use Latin1 encoding (or even better 'cp1252'
)export LANG=en_US.UTF8
(choose your own language if you do not like american english, the interesting part here is .UTF8
) and use UTF8 encoding - alternatively declare a iso-8859-1 charset (export LANG=en_US.ISO-8859-1
) and use Latin1 encodingUpvotes: 2
Reputation: 42758
What you are looking at, is the representation of strings, because printing lists is only for debugging purposes.
For printing lists, use .join
:
print ', '.join(sapnish_sents[1])
Upvotes: 1