Reputation: 105
I am trying to grep for the hexadecimal value of a range of UTF-8 encoded characters and I only want just that specific range of characters to be returned. I currently have this:
grep -P -n "[\xB9-\xBF]" $str_st_location >> output_st.txt
But this returns every character that has any of those hex values in it hex representation i.e it returns 00B9 - FFB9 as long as the B9 is present.
Is there a way I can specify using grep that I only want the exact/specific hex value range I search for?
Sample Input:
STRING_OPEN
Open
æ–å¼€
Ouvert
Abierto
Открыто
Abrir
Now using my grep statement, it should return the 3rd line and 6th line, but it also includes some text in my file that are Russian and Chinese because the range for languages include the hex values I'm searching for like these:
断开
Открыто
I can't give out more sample input unfortunately as it's work related.
EDIT: Actually the below code snippet worked!
grep -P -n "[\x{00B9}-\x{00BF}]" $str_st_location > output_st.txt
It found all the corrupted characters and there were no false positives. The only issue now is that the lines with the corrupted characters automatically gets "uncorrupted" i.e when I open the file, grep's output is the corrected version of the corrupted characters. For example, it finds æ–å¼€ and in the text file, it's show as 断开.
Upvotes: 4
Views: 6214
Reputation: 60843
Since you're using -P
, you're probably using GNU grep, because that is a GNU grep extension. Your command works using GNU grep 2.21 with pcre 8.37 and a UTF-8 locale, however there have been bugs in the past with multi-byte characters and character ranges. You're probably using an older version, or it is possible that your locale is set to one that uses single-byte characters.
If you don't want to upgrade, it is possible to match this character range by matching individual bytes, which should work in older versions. You would need to convert the characters to bytes and search for the byte values. Assuming UTF-8, U+00B9 is C2 B9 and U+00BF is C2 BF. Setting LC_CTYPE
to something that uses single-byte characters (like C
) will ensure that it will match individual bytes even in versions that correctly support multi-byte characters.
LC_CTYPE=C grep -P -n "\xC2[\xB9-\xBF]" $str_st_location >> output_st.txt
Upvotes: 3