Reputation: 31
So while helping someone debug some code I realized that there were some weird characters in their output, namely � and �(\xc0 and \xd0 in hex).
I wanted to find these characters in a large text output file.
I've managed to locate these characters using sublime by enabling the regex option in find with \xc0
or \xd0
being the query. I have also managed to grep
them by doing grep $'\xc0' filename
in bash.
The thing that is bothering me right now is that, if I use the -P
option for grep
, it refuses to find these characters.
grep -P "\xc0" filename
would print out nothing for a file that has that character in it(and the other two methods above would successfully find it), and this is bugging me so badly I want to know why this wouldn't work.
I have read a couple of other posts in which the -P
option along with "[\x80-\xff]"
are suggested but for some reason I just couldn't get them to work :\
grep -P
has been a good friend for a long time until now :( Any help and tips are appreciated!
I'm using GNU grep.
EDIT:
I have actually tried on 2 linux distributions.
printf "\xc0"
prints out nothing in the terminal, however printing it to a file with >
and then opening in sublime would show the character.
printf "\xc0" > foo
grep $'\xc0' foo > out1
grep -P '\xc0' foo > out2
grep -P '\x{c0}' foo > out3
out{1,2,3}
are all empty.
printf
prints something - the question mark dark thingyprintf "\xc0"
prints out �(actually looks like this)
printf "\xc0" > foo
grep $'\xc0' foo > out1
grep -P '\xc0' foo > out2
grep -P '\x{c0}' foo > out3
Only out1
contains the character.
Upvotes: 1
Views: 4219
Reputation:
What you need to do first is to create inside a variable the exact byte that you want to search.
Something like any of this:
a=$(echo -e '\xc0)
a=$'\xc0'
a=$(printf '\xc0')
a=$(echo -e '\300') # 300 is 0xC0 in octal
a=$'\300'
a=$(printf '\300')
a=$(echo "c0" | xxd -r -p)
I could try to come up with some other ways, but I hope you get the idea.
Then, you could try to search for the byte
with grep:
echo $'Testing this: \xC0 byte' | grep "$a"
And, if you use a locale with utf-8 (as is the most common) that will fail. If you change to a ISO-8859-1 locale, that will work:
LC_ALL=en_US.iso88591 echo $'Testing this: \xC0 byte' |
LC_ALL=en_US.iso88591 grep -P "$a"
Or, if you don't mind starting a new bash instance:
$ bash
$ export LC_ALL=en_US.iso88591
$ echo $'Testing this: \xC0 byte' | grep -P "$a"
And just return to the old bash environment by executing exit
.
This might work or not depending on your system.
Let's explore the other side: characters.
There is a very very important twist that you should understand.
A byte is not a character. Well, sometimes, by sheer luck, it is.
But beside those 128 ASCII characters in which a byte is a character (not in UTF-16 or UTF-32. And let's also forget about EBCDIC), all 1,114,112 (17 × 65,536) UNICODE code points have more than one byte 1.
In that case, you should ask for the UNICODE code point of hex 0xC0
.
In modern bash, like this:
$ printf '\U00C0`
À
Which is this character: LATIN CAPITAL LETTER A WITH GRAVE
That will be encoded as one byte if the locale is ISO-8859-1 (and ISO-8859-15, at least) and as two bytes if the locale is utf-8.
$ a=$(printf '\UC0')
$ printf 'Testing \U00C0 character' | grep -P "$a"
Testing À character
It also will work if you change the LC_ALL variable. Well, I mean that grep will detect the character, but the printed line may fail to render correctly the character due to the changed locale.
If the file has this character and the encoding of the file is correct. Grep will work with the value of the character in a variable.
Upvotes: 1