Reputation: 154
For a course called 'Programming Techniques', I have to scan a file with lines having the following format:
[IP-Address] - - [[Date and time]] "GET [some URL]" [HTML reply code] [some non-interesting number]
An example:
129.232.223.206 - - [30/Apr/1998:22:00:02 +0000] "GET /images/home_intro.anim.gif HTTP/1.0" 200 60349
My task is to scan all lines and extract from it the HTTP reply code only if this code is not equal to 200.
We have to use the command line. The following almost works:
cat file.out | sed 's/^.*\"[[:space:]]//' | sed 's/[[:space:]].*//' | grep -v '200' | sort | uniq 1> result1.txt
First, read in the file, remove everything up until the second "
and the space after it, remove everything from the first space to the end, remove lines with 200
, sort the numbers, remove duplicates, and send the remaining numbers to a file.
This produces the following output:
-
206
26.146.85.150ÀüŒÛ/ HTTP/1.0" 404 305
302
304
400
404
500
As we can see, it almost works. There is one line causing trouble:
26.146.85.150 - - [01/May/1998:16:47:28 +0000] "GET /images/home_fr_phra><HR><H3>\C0\FC\BC\DB/ HTTP/1.0" 404 305
This line causes the weird third output-line. What is wrong with this line? The only thing I can think of is the part \C0\FC\BC\DB
. Backslashes always seem to cause trouble. So, what part of my command conflicts with this line?
Also, I noticed that if I switched sort
and uniq
, the file does get sorted, but duplicates do not get removed. Why?
(By the way, I'm relatively new to using the command line for the purposes described above.)
Upvotes: 0
Views: 126
Reputation: 44063
So, this looks like encoding SNAFU. If I'm not mistaken, what's happening is:
sed
attempts to read the file as UTF-8 because of the aforementioned locale, andsed
breaks because of this (in particular, .
does not match the offending bytes).The stuff with the backslashes denotes a series of four bytes by their hex values, that is C0 FC BC DB
. This is not valid UTF-8-encoded data.1
Given an UTF-8 locale, (GNU) sed
interprets input as UTF-8, and .
matches a valid UTF-8 character. It does not match invalid byte sequences. You can see this by running
echo -e '\xc0\xfc\xbc\xdb' | sed 's/.//g'
in a UTF-8 locale and noticing that the output is not empty. I am inclined to agree that this behavior is a bit of a nuisance, but here we are.
Since you don't seem to rely on any Unicode features, the solution could be to run sed
with a non-UTF-8 locale, such as C
. In your case:
cat file.out | LC_ALL=C sed 's/^.*\"[[:space:]]//' \
| LC_ALL=C sed 's/[[:space:]].*//' \
| grep -v '200' \
| sort \
| uniq 1 \
> result1.txt
(line breaks added for readability). By the way, you could conflate the two sed
commands to a single one as follows:
LC_ALL=C sed 's/^.*\"[[:space:]]//; s/[[:space:]].*//'
1 c0
would indicate a two-byte UTF-8 code whose uppermost five bits are zero, which already makes no sense since it could be encoded as plain ASCII, and fc
does not begin with the 10
bits in the uppermost half-nibble that the UTF-8 encoding would require there. So, although I am unsure what exactly their encoding is, it is definitely not UTF-8.
Upvotes: 1