user2056389
user2056389

Reputation: 105

How to grep for exact hexadecimal value of characters

I am trying to grep for the hexadecimal value of a range of UTF-8 encoded characters and I only want just that specific range of characters to be returned. I currently have this:

grep -P -n "[\xB9-\xBF]" $str_st_location >> output_st.txt

But this returns every character that has any of those hex values in it hex representation i.e it returns 00B9 - FFB9 as long as the B9 is present.

Is there a way I can specify using grep that I only want the exact/specific hex value range I search for?

Sample Input:

STRING_OPEN
Open
æ–­å¼€
Ouvert
Abierto
Открыто
Abrir

Now using my grep statement, it should return the 3rd line and 6th line, but it also includes some text in my file that are Russian and Chinese because the range for languages include the hex values I'm searching for like these:

断开
Открыто

I can't give out more sample input unfortunately as it's work related.

EDIT: Actually the below code snippet worked!

grep -P  -n "[\x{00B9}-\x{00BF}]" $str_st_location > output_st.txt

It found all the corrupted characters and there were no false positives. The only issue now is that the lines with the corrupted characters automatically gets "uncorrupted" i.e when I open the file, grep's output is the corrected version of the corrupted characters. For example, it finds æ–­å¼€ and in the text file, it's show as 断开.

Upvotes: 4

Views: 6214

Answers (1)

mark4o
mark4o

Reputation: 60843

Since you're using -P, you're probably using GNU grep, because that is a GNU grep extension. Your command works using GNU grep 2.21 with pcre 8.37 and a UTF-8 locale, however there have been bugs in the past with multi-byte characters and character ranges. You're probably using an older version, or it is possible that your locale is set to one that uses single-byte characters.

If you don't want to upgrade, it is possible to match this character range by matching individual bytes, which should work in older versions. You would need to convert the characters to bytes and search for the byte values. Assuming UTF-8, U+00B9 is C2 B9 and U+00BF is C2 BF. Setting LC_CTYPE to something that uses single-byte characters (like C) will ensure that it will match individual bytes even in versions that correctly support multi-byte characters.

LC_CTYPE=C grep -P -n "\xC2[\xB9-\xBF]" $str_st_location >> output_st.txt

Upvotes: 3

Related Questions