Kevin Burke
Kevin Burke

Reputation: 64774

Inconsistent file behavior

I'm trying to track down a Python UnicodeDecodeError in the following log line:

10.210.141.123 - - [09/Nov/2011:14:41:04 -0800] "gfR\x15¢\x09ì|Äbk\x0F[×ÐÖà\x11CEÐÌy\x5C¿DÌj\x08Ï ®At\x07å!;f>\x08éPW¤\x1C\x02ö*6+\x5C\x15{,ªIkCRA\x22 xþP9â\x13h\x01­¢è´\x1DzõWiË\x5C\x10sòʨR)¶²\x1F8äl¾¢{ÆNw\x08÷@ï" 400 166 0.000 "-" "-"

I opened the entire log file in Vim, and then yanked the line into a new file so I could test just the one line. However, my parsing script works OK with the new file - it doesn't throw a UnicodeDecodeError. I don't understand why the one file would generate an error and the other one would not, when they are (on the surface) identical.

Here's what I tried: running enca to determine the file encoding, which complained that it Cannot determine (or understand) your language preferences. file -i says that both files are Regular files. I also deleted every other line in the original log file and still got the error in one file and no error in the other. I tried deleting

set encoding=utf-8 

from my .vimrc, writing the file again, and I still got the error in one file and not in the other.

The logs are nginx logs. Nginx has this note in their release notes:

*) Change: now the 0x00-0x1F, '"' and '\' characters are escaped as \xXX
   in an access_log.
   Thanks to Maxim Dounin.

My Python script has with open('log_file') as f and the error comes up when I try to call json.dumps on a dict.

How can I track this down?

Upvotes: 0

Views: 209

Answers (1)

John Machin
John Machin

Reputation: 82924

Your question: How can I track this down?

Answer:

(1) Show us the full text of the error message that you got -- without knowing what encoding that you were trying to use, we can't tell you anything. A traceback and a snippet of code that reads the file and reproduces the error would also be handy.

(2) Write a tiny Python script to find the line in the file and then do:

print repr(the_line) # Python 2.X
print ascii(the_line) # Python 3.x

and copy/paste the result into an edit of your question, so that we can see unambiguously what is in the line.

(3) It does look like random gibberish except for the ­ but do tell us whether you expect that line to be text (if so, in what human language?).

Upvotes: 1

Related Questions