a283626086
a283626086

Reputation: 31

How do I grep for special character(control characters) using hex representation

So while helping someone debug some code I realized that there were some weird characters in their output, namely � and �(\xc0 and \xd0 in hex).

I wanted to find these characters in a large text output file.

I've managed to locate these characters using sublime by enabling the regex option in find with \xc0 or \xd0 being the query. I have also managed to grep them by doing grep $'\xc0' filename in bash.

The thing that is bothering me right now is that, if I use the -P option for grep, it refuses to find these characters.

grep -P "\xc0" filename would print out nothing for a file that has that character in it(and the other two methods above would successfully find it), and this is bugging me so badly I want to know why this wouldn't work.

I have read a couple of other posts in which the -P option along with "[\x80-\xff]" are suggested but for some reason I just couldn't get them to work :\

grep -P has been a good friend for a long time until now :( Any help and tips are appreciated!

I'm using GNU grep.

EDIT:

I have actually tried on 2 linux distributions.

printf "\xc0" prints out nothing in the terminal, however printing it to a file with > and then opening in sublime would show the character.

printf "\xc0" > foo
grep $'\xc0' foo > out1
grep -P '\xc0' foo > out2
grep -P '\x{c0}' foo > out3

out{1,2,3} are all empty.

printf "\xc0" prints out �(actually looks like this)

printf "\xc0" > foo
grep $'\xc0' foo > out1
grep -P '\xc0' foo > out2
grep -P '\x{c0}' foo > out3

Only out1 contains the character.

Upvotes: 1

Views: 4219

Answers (1)

user8017719
user8017719

Reputation:

byte

What you need to do first is to create inside a variable the exact byte that you want to search.

Something like any of this:

a=$(echo -e '\xc0)
a=$'\xc0'
a=$(printf '\xc0')
a=$(echo -e '\300')     # 300 is 0xC0 in octal
a=$'\300'
a=$(printf '\300')
a=$(echo "c0" | xxd -r -p)

I could try to come up with some other ways, but I hope you get the idea.

Then, you could try to search for the byte with grep:

echo $'Testing this: \xC0 byte' |  grep "$a"

And, if you use a locale with utf-8 (as is the most common) that will fail. If you change to a ISO-8859-1 locale, that will work:

LC_ALL=en_US.iso88591 echo $'Testing this: \xC0 byte' |
LC_ALL=en_US.iso88591  grep -P "$a"

Or, if you don't mind starting a new bash instance:

$ bash
$ export LC_ALL=en_US.iso88591
$ echo $'Testing this: \xC0 byte' |  grep -P "$a"

And just return to the old bash environment by executing exit.
This might work or not depending on your system.

Let's explore the other side: characters.

character

There is a very very important twist that you should understand.
A byte is not a character. Well, sometimes, by sheer luck, it is.

But beside those 128 ASCII characters in which a byte is a character (not in UTF-16 or UTF-32. And let's also forget about EBCDIC), all 1,114,112 (17 × 65,536) UNICODE code points have more than one byte 1.

In that case, you should ask for the UNICODE code point of hex 0xC0.
In modern bash, like this:

$ printf '\U00C0`
À

Which is this character: LATIN CAPITAL LETTER A WITH GRAVE

That will be encoded as one byte if the locale is ISO-8859-1 (and ISO-8859-15, at least) and as two bytes if the locale is utf-8.

$ a=$(printf '\UC0')
$ printf 'Testing \U00C0 character' | grep -P "$a"
Testing À character

It also will work if you change the LC_ALL variable. Well, I mean that grep will detect the character, but the printed line may fail to render correctly the character due to the changed locale.

If the file has this character and the encoding of the file is correct. Grep will work with the value of the character in a variable.

Upvotes: 1

Related Questions