Tom
Tom

Reputation: 9653

What is the encoding format of nginx's access log?

What is the encoding of Nginx's access.log? I'm trying to iterate through the file but my script raises an exception "invalid byte sequence in UTF-8" when it sees requests to my server that have Chinese/Thai characters in them.

Upvotes: 3

Views: 3588

Answers (1)

jkellydresser
jkellydresser

Reputation: 51

An HTTP protocol request is mostly ASCII, with data fields allowed to be any octets. See Which encoding is used by the HTTP protocol? .

Nginx's error and access logs will reflect that.

My experience with 450,000 records from a small North American site revealed that all but one record decoded to ASCII without error. That record contained 4 consecutive bytes (b'\xb8E\x8c\xde') that were invalid UTF-8 but were valid big5hkscs (Python's codec for Traditional Chinese) that produced two glyphs.

See How can I programmatically find the list of codecs known to Python? to get a list of codec names for a brute force attack on non-ASCII bits.

Decoding the binary log records to ASCII, with '?' replacements for errors, was good enough for my needs.

Upvotes: 1

Related Questions