user1717828
user1717828

Reputation: 7223

Convert text to bytes from Bash shell?

How can a text string be turned into UTF-8 encoded bytes using Bash and/or common Linux command line utilities? For example, in Python one would do:

"Six of one, ½ dozen of the other".encode('utf-8')
b'Six of one, \xc2\xbd dozen of the other'

Is there a way to do this in pure Bash:

STR="Six of one, ½ dozen of the other"
<utility_or_bash_command_here> --encoding='utf-8' $STR
'Six of one, \xc2\xbd dozen of the other'

Upvotes: 3

Views: 6839

Answers (3)

Louis Maddox
Louis Maddox

Reputation: 5576

I adapted Machinexa's nice answer a little for my needs

  • encoding="utf-8" is the default so no need to pass
  • more concise to just import sys and use directly
  • Here I'm looking to make a unique set not a list or concatenated bytestring
alias encode='python3 -c "import sys; enc = sys.stdin.read().encode(); print(set(enc))"'

So then I can get a set without repetition:

printf "hell0\x0\nworld\n:-)\x0:-(\n" | \
  grep -a "[[:cntrl:]]" -o | \
  perl -pe 's/([^x\0-\x7f])/"\\x" . sprintf "%x", ord $1/ge' | \
  encode

{b'\x00'}

and then if you wanted to drop the Python byte repr b'' and the backslash:

alias encode='python3 -c "from sys import stdin; encoded = stdin.read().encode(\"utf-8\"); s = set(encoded.splitlines()[:-1]); print({repr(char)[3:-1] for char in s})"'

which for the previous command gives {'x00'} instead

Upvotes: 0

Machinexa
Machinexa

Reputation: 599

Python to the rescue!

alias encode='python3 -c "from sys import stdin; print(stdin.read().encode(\"utf-8\"))"'
root@kali-linux:~# echo "½ " | encode
b'\xc2\xbd \n'

Also, you can remove b'' with some sed/awk thingy if you want.

Upvotes: 4

choroba
choroba

Reputation: 242373

Perl to the rescue!

echo "$STR" | perl -pe 's/([^x\0-\x7f])/"\\x" . sprintf "%x", ord $1/ge'

The /e modifier allows to include code into the replacement part of the s/// substitution, which in this case converts ord to hex via sprintf.

Upvotes: 3

Related Questions