uldics
uldics

Reputation: 134

In Bash, how to convert only extended ASCII chars to their hex codes?

I need to check my string variable for presence of extended ASCII characters, one byte, decimal code 128-255. If any is there, replace it with multiple character hex equivalent, ready for further grep command etc.

Example string: "Ørsted\ Salg", I need it to be converted to "\xD8rsted\ Salg".

I know the way to do it with hastable in Bash 4:

declare -A symbolHashTable=(
    ["Ø"]="D8"
);
currSearchTerm="Ørsted\ Salg"
for curRow in "${!symbolHashTable[@]}"; do
    currSearchTerm=$(echo $currSearchTerm | sed s/$curRow/'\\x'${symbolHashTable[$curRow]}/)
done

, but that seems too tedious for 127 cases. There should be a way to do it shorter and probably faster, without writing all the symbols.

I can detect whether the string has any of the characters in it with:

echo $currSearchTerm | grep -P "[\x80-\xFF]"

I am almost sure there is a way to make sed do it, but I get lost somewhere in the "replace with" part.

Upvotes: 0

Views: 463

Answers (1)

that other guy
that other guy

Reputation: 123670

You can easily do this with Perl:

#!/bin/bash
original='Ørsted'
replaced=$(perl -pe 's/([\x80-\xFF])/"\\x".unpack "H*", $1/eg' <<< "$original")

echo "The original variable's hex encoding is:"
od -t x1 <<< "$original"

echo "Therefore I converted $original into $replaced"

Here's the output when the file and terminal is ISO-8859-1:

The original variable's hex encoding is:
0000000 d8 72 73 74 65 64 0a
0000007
Therefore I converted Ørsted into \xd8rsted

Here's the output when the file and terminal is UTF-8:

The original variable's hex encoding is:
0000000 c3 98 72 73 74 65 64 0a
0000010
Therefore I converted Ørsted into \xc3\x98rsted

In both cases it works as expected.

Upvotes: 2

Related Questions