Reputation: 9653
What is the encoding of Nginx's access.log? I'm trying to iterate through the file but my script raises an exception "invalid byte sequence in UTF-8
" when it sees requests to my server that have Chinese/Thai characters in them.
Upvotes: 3
Views: 3588
Reputation: 51
An HTTP protocol request is mostly ASCII, with data fields allowed to be any octets. See Which encoding is used by the HTTP protocol? .
Nginx's error and access logs will reflect that.
My experience with 450,000 records from a small North American site revealed that all but one record decoded to ASCII without error. That record contained 4 consecutive bytes (b'\xb8E\x8c\xde') that were invalid UTF-8 but were valid big5hkscs (Python's codec for Traditional Chinese) that produced two glyphs.
See How can I programmatically find the list of codecs known to Python? to get a list of codec names for a brute force attack on non-ASCII bits.
Decoding the binary log records to ASCII, with '?' replacements for errors, was good enough for my needs.
Upvotes: 1