limitIntegral314
limitIntegral314

Reputation: 154

bash: sed and/or grep having problems with specific line

For a course called 'Programming Techniques', I have to scan a file with lines having the following format:

[IP-Address] - - [[Date and time]] "GET [some URL]" [HTML reply code] [some non-interesting number]

An example:

129.232.223.206 - - [30/Apr/1998:22:00:02 +0000] "GET /images/home_intro.anim.gif HTTP/1.0" 200 60349

My task is to scan all lines and extract from it the HTTP reply code only if this code is not equal to 200.

We have to use the command line. The following almost works:

cat file.out | sed 's/^.*\"[[:space:]]//' | sed 's/[[:space:]].*//' | grep -v '200' | sort | uniq 1> result1.txt

First, read in the file, remove everything up until the second " and the space after it, remove everything from the first space to the end, remove lines with 200, sort the numbers, remove duplicates, and send the remaining numbers to a file.

This produces the following output:

-
206
26.146.85.150ÀüŒÛ/ HTTP/1.0" 404 305
302
304
400
404
500

As we can see, it almost works. There is one line causing trouble:

26.146.85.150 - - [01/May/1998:16:47:28 +0000] "GET /images/home_fr_phra><HR><H3>\C0\FC\BC\DB/ HTTP/1.0" 404 305

This line causes the weird third output-line. What is wrong with this line? The only thing I can think of is the part \C0\FC\BC\DB. Backslashes always seem to cause trouble. So, what part of my command conflicts with this line?

Also, I noticed that if I switched sort and uniq, the file does get sorted, but duplicates do not get removed. Why?

(By the way, I'm relatively new to using the command line for the purposes described above.)

Upvotes: 0

Views: 126

Answers (1)

Wintermute
Wintermute

Reputation: 44063

So, this looks like encoding SNAFU. If I'm not mistaken, what's happening is:

  1. You're using an UTF-8 locale,
  2. The input file does not contain valid UTF-8,
  3. sed attempts to read the file as UTF-8 because of the aforementioned locale, and
  4. sed breaks because of this (in particular, . does not match the offending bytes).

The stuff with the backslashes denotes a series of four bytes by their hex values, that is C0 FC BC DB. This is not valid UTF-8-encoded data.1

Given an UTF-8 locale, (GNU) sed interprets input as UTF-8, and . matches a valid UTF-8 character. It does not match invalid byte sequences. You can see this by running

echo -e '\xc0\xfc\xbc\xdb' | sed 's/.//g'

in a UTF-8 locale and noticing that the output is not empty. I am inclined to agree that this behavior is a bit of a nuisance, but here we are.

Since you don't seem to rely on any Unicode features, the solution could be to run sed with a non-UTF-8 locale, such as C. In your case:

cat file.out | LC_ALL=C sed 's/^.*\"[[:space:]]//' \
             | LC_ALL=C sed 's/[[:space:]].*//' \
             | grep -v '200' \
             | sort \
             | uniq 1 \
             > result1.txt

(line breaks added for readability). By the way, you could conflate the two sed commands to a single one as follows:

LC_ALL=C sed 's/^.*\"[[:space:]]//; s/[[:space:]].*//'

1 c0 would indicate a two-byte UTF-8 code whose uppermost five bits are zero, which already makes no sense since it could be encoded as plain ASCII, and fc does not begin with the 10 bits in the uppermost half-nibble that the UTF-8 encoding would require there. So, although I am unsure what exactly their encoding is, it is definitely not UTF-8.

Upvotes: 1

Related Questions