Reputation: 97
While trying to write a string to a file the following error occurred:
Code
logfile.write(cli_args.last_name)
Output
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordinal not in range(128)
But this works:
Code
print(cli_args.last_name)
Output
Pérez
Why?
I made a script which receives data from a Linux CLI, processes it and finally creates a Zendesk ticket with the provided data. It is kind of a CLI API, since before my script there is a bigger system which has a web interface with forms, where users fill the values of the fields and are then replaced into the CLI script. For example:
myscript.py --first_name '_first_name_' --last_name '_last_name_'
The script was working with no issues, until yesterday when the web was updated. I think they changed something related to charsets or encoding.
I do some simple logging with F-strings by opening a file and writing some informative messages in case anything fails, so I can go back to check where it happened. Also the CLI attributes are read using the argparse module. Example:
logfile.write(f"\tChecking for opened tickets for user '{cli_args.first_name} {cli_args.last_name}'\n")
After the website update I am getting an error like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordinal not in range(128)
Doing some troubleshooting I found it is because some users input names with accent marks like Carlos Pérez
.
I need the script to work again and also prepare it for inputs like that, so I looked for answers by checking the HTTP headers in the input forms of the web console and found out it uses a Content-Type: text/html; charset=UTF-8
; my first try was to encode the str
passed in the CLI argument to utf-8
and decode it again using the same codec, but didn't succeed.
On my second try, I checked the Python docs str.encode() and bytes.decode(). So I tried this:
logfile.write(
"\tChecking for opened tickets for user "
f"'{cli_args.first_name.encode(encoding='utf-8', errors='ignore').decode('utf-8')} "
f"{cli_args.last_name.encode(encoding='utf-8', errors='ignore').decode('utf-8')}'"
)
It worked but removed the accent marked letter so Carlos Pérez
became Carlos Prez
which is of no use to me in this case, I need the full input.
As a desperate move I tried printing the same F-string I was trying to write to the logfile, which to my surprise it worked. It printed to the console Carlos Pérez
without any kind of encoding/decoding process.
How does print work? and Why trying to write to the file didn't work? But most importantly How can I write to a file with the same formatting as print?
Tried the following:
logfile = open("/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/755bug.txt", mode="a", encoding="utf8")
logfile.write(cli_args.body)
logfile.close()
Output:
Traceback (most recent call last): File "/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/ticket_query_app.py", line 414, in main() File "/usr/share/pandora_server/util/plugin/plugin_mcm/sandbox/ticket_query_app.py", line 81, in main logfile.write(cli_args.body) UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed
I managed to get the text that is causing the issue:
if __name__ == "__main__":
string = (
"Buenos d\udcc3\udcadas,\r\n\r\n"
"Mediante monitoreo autom\udcc3\udca1tico se ha detectado un evento fuera de lo normal:\r\n\r\n"
"Descripci\udcc3\udcb3n del evento: _snmp_f13_\r\n"
"Causas sugeridas del evento: _snmp_f14_\r\n"
"Posible afectaci\udcc3\udcb3n del evento: _snmp_f15_\r\n"
"Validaciones de bajo impacto: _snmp_f16_\r\n"
"Fecha y hora del evento: 2021-07-14 17:47:51\r\n\r\n"
"Saludos."
)
# Output: Text with the unicodes translated
print(string)
# Output: "UnicodeEncodeError: 'utf-8' codec can't encode characters in position 8-9: surrogates not allowed"
with open(file="test.log", mode="w", encoding="utf8") as logfile:
logfile.write(string)
Upvotes: 0
Views: 539
Reputation: 177554
It looks like something upstream is misconfigured. Your string
appears to have been produced by a decode
operation with the wrong encoding, with errors='surrogateescape'
error handling. From the data shown, it looks like the decoding operation tried to decode UTF-8-encoded text as ASCII.
errors='surrogateescape'
is a way for an encoding to handle invalid bytes during a decode
operation. The error handler replaces the invalid bytes with partial surrogates in the range U+DC80..U+DCFF when converting to a Unicode string, and the process can be reversed to get the original byte string back by performing an encode
with errors='surrogateescape'
and the same encoding.
The partial surrogates in your string
match the pattern of what a decode(encoding='ascii', errors='surrogateescape')
call would produce when given data actually encoded in UTF-8 - the surrogates are all in the range surrogateescape
uses, and the bytes they correspond to form valid UTF-8. In the code below, I recover the original bytes, then decode them correctly as UTF-8. Once the Unicode string is valid, it can be written to the log file with encoding='utf8'
.
string = (
"Buenos d\udcc3\udcadas,\r\n\r\n"
"Mediante monitoreo autom\udcc3\udca1tico se ha detectado un evento fuera de lo normal:\r\n\r\n"
"Descripci\udcc3\udcb3n del evento: _snmp_f13_\r\n"
"Causas sugeridas del evento: _snmp_f14_\r\n"
"Posible afectaci\udcc3\udcb3n del evento: _snmp_f15_\r\n"
"Validaciones de bajo impacto: _snmp_f16_\r\n"
"Fecha y hora del evento: 2021-07-14 17:47:51\r\n\r\n"
"Saludos."
)
fixed = string.encode('ascii',errors='surrogateescape').decode('utf8')
print(fixed)
with open(file="test.log", mode="w", encoding="utf8") as logfile:
logfile.write(fixed)
You can read more about surrogate escapes in PEP 383.
Upvotes: 1
Reputation: 54698
The answer is the encoding
parameter to open
. Observe:
Last login: Wed Jul 14 15:05:24 2021 from 50.126.68.34
[timrprobocom@jared-ingersoll ~]$ python3
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> f = open('x.txt','a')
>>> g = open('y.txt','a',encoding='utf-8')
>>> s = "spades \u2660 spades"
>>> f.write(s)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\u2660' in position 7: ordinal not in range(128)
>>> g.write(s)
15
>>>
[timrprobocom@jared-ingersoll ~]$ hexdump -C y.txt
00000000 73 70 61 64 65 73 20 e2 99 a0 20 73 70 61 64 65 |spades ... spade|
*
00000011
Upvotes: 2