Holly16
Holly16

Reputation: 11

Spanish characters not being displayed on the terminal in python

I downloaded Spanish text from NLTK in python using

spanish_sents=nltk.corpus.floresta.sents()

when printing the sentences in the terminal the corresponding Spanish characters are not rendered. For example printing spanish_sents[1] produces characters like u'\xe9' and if I encode it using utf-8 as in

print [x.encode("utf-8") for x in sapnish_sents[1]] 

it produces '\xc3\xa9' and encoding in latin3

print [x.encode("latin3") for x in sapnish_sents[1]] 

it produces '\xe9'

How can I configure my terminal to print the glyphs for these points? Thanks

Upvotes: 1

Views: 5688

Answers (3)

cwallenpoole
cwallenpoole

Reputation: 82028

My guess is that there are a few things going on. First, you're iterating through a str (is sapnish_sents[1] one entire entry? What happens when you print that). Second, you're not getting full characters because you're iterating through a str (a unicode character takes more "space" than an ASCII character, so addressing a single index will look weird). Third you are trying to encode when you probably mean to decode.

Try this:

 print sapnish_sents[1].decode('utf-8')

I just ran the following in my terminal to help give context:

>>> a = '®†\¨ˆø' # Storing non-ASCII characters in a str is ill-advised;
                 # I do this as an example because it's what I think your question is
                 # really asking
>>> a # a now looks like a bunch of gibberish if I just output
'\xc2\xae\xe2\x80\xa0\\\xc2\xa8\xcb\x86\xc3\xb8'
>>> print a # Well, this looks normal.
®†\¨ˆø
>>> print repr(a) # Just demonstrating how the above works
'\xc2\xae\xe2\x80\xa0\\\xc2\xa8\xcb\x86\xc3\xb8'
>>> a[0] # We're only looking at one character, which is represented by all this stuff.
'\xc2' 
>>> print a[0] # But because it's not a complete unicode character, the terminal balks
?
>>> print a.decode('utf-8') # Look familiar?
®†\¨ˆø
>>> print a.decode('utf-8')[0] # Our first character!
®

Upvotes: 0

Serge Ballesta
Serge Ballesta

Reputation: 148965

Just an initial remark, Latin3 or ISO-8859-3 is indeed denoted as South European, but it was designed to cover Turkish, Maltese and Esperanto. Spanish is more commonly encoded in Latin1 (ISO-8859-1 or West European) or Latin9 (ISO-8859-15).

I can confirm that the letter é has the unicode code point U+00E9, and is represented as '\xe9' in both Latin1 and Latin3. And it is encoded as '\xc3\xc9' in UTF8, so all your conversions are correct.

But the real question How can I configure my terminal... ? is hard to answer without knowing what the terminal is...

  • if it is a true teletype or old vt100 without accented characters: you cannot (but I do not think you use that...)
  • if you use a Windows console, declare the codepage 1252 (very near to Latin1): chcp 1252 and use Latin1 encoding (or even better 'cp1252')
  • if you use xterm (or any derivative) on Linux or any other Unix or Unix-like, declare an utf8 charset with export LANG=en_US.UTF8 (choose your own language if you do not like american english, the interesting part here is .UTF8) and use UTF8 encoding - alternatively declare a iso-8859-1 charset (export LANG=en_US.ISO-8859-1) and use Latin1 encoding

Upvotes: 2

Daniel
Daniel

Reputation: 42758

What you are looking at, is the representation of strings, because printing lists is only for debugging purposes.

For printing lists, use .join:

print ', '.join(sapnish_sents[1])

Upvotes: 1

Related Questions