crazyaboutliv
crazyaboutliv

Reputation: 3199

Handling French text Python

I am trying to read some French text and do some frequency analysis of words. I want the characters with the umlauts and other diacritics to stay. So, I did this for testing:

>>> import codecs
>>> f = codecs.open('file','r','utf-8')
>>> for line in f:
...     print line
...

Faites savoir à votre famille que vous êtes en sécurité.

So far, so good. But, I have a list of French files which I iterate over in the following way:

import codecs,sys,os

path = sys.argv[1]
for f in os.listdir(path):
    french = codecs.open(os.path.join(path,f),'r','utf-8')
    for line in french:
        print line

Here, it gives the following error:

rdholaki74: python TestingCodecs.py ../frenchResources | more
Traceback (most recent call last):
  File "TestingCodecs.py", line 7, in <module>
    print line
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 14: ordinal not in range(128)

Why is it that the same file throws up an error when passed as an argument and not when given explicitly in the code?

Thanks.

Upvotes: 2

Views: 2065

Answers (2)

jfs
jfs

Reputation: 414745

It is a print error due to redirection. You could use:

PYTHONIOENCODING=utf-8 python ... | ...

Specify another encoding if your terminal doesn't use utf-8

Upvotes: 2

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 799230

Because you're misinterpreting the cause. The fact that you're piping the output means that Python can't detect what encoding to use. If stdout is not a TTY then you'll need to encode as UTF-8 manually before outputting.

Upvotes: 2

Related Questions