Why is there a unicode error only when piping the output of a python script?

Question

I use a Python script written in 2.7 (seafile-cli, from Seafile, a file synchronization solution).

I know that unicode is problematic in Python 2 but the filenames with diacritic signs are thankfully handled correctly when starting the script:

$ # seaf-cli status
# Name  Status  Progress
photos  downloading     0/0, 0.0KB/s
Ma bibliothèque downloading     566/1770, 1745.7KB/s
videos  downloading     28/1203, 5088.0KB/s
dev-perso       downloading     0/0, 0.0KB/s
dev-pro downloading     0/0, 0.0KB/s

To my surprise, when piping this output, the Python script crashes with UnicodeEncodeError:

$ seaf-cli status | cat -
# Name  Status  Progress
photos  downloading     0/0, 0.0KB/s
Traceback (most recent call last):
  File "/usr/bin/seaf-cli", line 845, in 
    main()
  File "/usr/bin/seaf-cli", line 841, in main
    args.func(args)
  File "/usr/bin/seaf-cli", line 649, in seaf_status
    tx_task.rate/1024.0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 11: ordinal not in range(128)

While I understand that it might have had issues with Ma bibliothèque in the first place (which it has not), why piping it triggers a Traceback?

Shouldn't that be the problem of the shell? - the output has "left" the script at that point.

EDIT: the answer is in another question. Marking as duplicate.

Chen A. · Accepted Answer

Python knows how to handle encoding inside your program because it uses whatever encoding your terminal application is using.

When you are sending (piping) your output out, it needs to be encoded. This is because using pipe actually sends streams of bytes between the applications. Every pipe is a unidirectional channel, where one side writes data and the other side reads it.

Using pipes or redirections, you are sending out data to a fd, which is read by the another application.

So you need to make sure Python correctly encode the data before it sends it out, and then the input program needs to decode it before processing.

You also might find this question useful

Update: I'll try to elaborate more about encoding. What I mean by the first line of my answer is, because your Python interpreter uses specific encoding, it knows how to transform the hexa values (actual bytes) to symbols.

My interpreter doesn't; if I try to create a string from your text - I get an error:

>>> s = 'bibliothèque'
Unsupported characters in input

This is because I use different encoding on my interpreter.

Your shell uses different encoding than the Python interpreter. When Python sends data out of your program, it uses default encoding: ASCII. It can't translate your special character (which displayed by the hexa value \xe8) using ASCII. So, you have to specify which encoding to use in order for Python to send it.

You might be able to overcome this if you change your shell encoding - check this question on SO.

PS - There's a great video by Ned Batchelder about Unicode on youtube - Maybe this will shed some more light on the subject.

Why is there a unicode error only when piping the output of a python script?

Answers (1)

Related Questions