bash: sed and/or grep having problems with specific line

Question

For a course called 'Programming Techniques', I have to scan a file with lines having the following format:

[IP-Address] - - [[Date and time]] "GET [some URL]" [HTML reply code] [some non-interesting number]

An example:

129.232.223.206 - - [30/Apr/1998:22:00:02 +0000] "GET /images/home_intro.anim.gif HTTP/1.0" 200 60349

My task is to scan all lines and extract from it the HTTP reply code only if this code is not equal to 200.

We have to use the command line. The following almost works:

First, read in the file, remove everything up until the second " and the space after it, remove everything from the first space to the end, remove lines with 200, sort the numbers, remove duplicates, and send the remaining numbers to a file.

This produces the following output:

-
206
26.146.85.150ÀüŒÛ/ HTTP/1.0" 404 305
302
304
400
404
500

As we can see, it almost works. There is one line causing trouble:

26.146.85.150 - - [01/May/1998:16:47:28 +0000] "GET /images/home_fr_phra>

`\C0\FC\BC\DB/ HTTP/1.0" 404 305`

This line causes the weird third output-line. What is wrong with this line? The only thing I can think of is the part `\C0\FC\BC\DB`. Backslashes always seem to cause trouble. So, what part of my command conflicts with this line?

Also, I noticed that if I switched `sort` and `uniq`, the file does get sorted, but duplicates do not get removed. Why?

(By the way, I'm relatively new to using the command line for the purposes described above.)

bash: sed and/or grep having problems with specific line

Answers (1)

Related Questions