Reputation: 134
I need to check my string variable for presence of extended ASCII characters, one byte, decimal code 128-255. If any is there, replace it with multiple character hex equivalent, ready for further grep command etc.
Example string: "Ørsted\ Salg", I need it to be converted to "\xD8rsted\ Salg".
I know the way to do it with hastable in Bash 4:
declare -A symbolHashTable=(
["Ø"]="D8"
);
currSearchTerm="Ørsted\ Salg"
for curRow in "${!symbolHashTable[@]}"; do
currSearchTerm=$(echo $currSearchTerm | sed s/$curRow/'\\x'${symbolHashTable[$curRow]}/)
done
, but that seems too tedious for 127 cases. There should be a way to do it shorter and probably faster, without writing all the symbols.
I can detect whether the string has any of the characters in it with:
echo $currSearchTerm | grep -P "[\x80-\xFF]"
I am almost sure there is a way to make sed do it, but I get lost somewhere in the "replace with" part.
Upvotes: 0
Views: 463
Reputation: 123670
You can easily do this with Perl:
#!/bin/bash
original='Ørsted'
replaced=$(perl -pe 's/([\x80-\xFF])/"\\x".unpack "H*", $1/eg' <<< "$original")
echo "The original variable's hex encoding is:"
od -t x1 <<< "$original"
echo "Therefore I converted $original into $replaced"
Here's the output when the file and terminal is ISO-8859-1:
The original variable's hex encoding is:
0000000 d8 72 73 74 65 64 0a
0000007
Therefore I converted Ørsted into \xd8rsted
Here's the output when the file and terminal is UTF-8:
The original variable's hex encoding is:
0000000 c3 98 72 73 74 65 64 0a
0000010
Therefore I converted Ørsted into \xc3\x98rsted
In both cases it works as expected.
Upvotes: 2