How to split an UTF-8 string by an escape sequence provided as command line argument in Python 3?

Question

I'm trying to seperate UTF-8 strings by a delimiter provided as command line argument in Python3. The TAB character " " should be a valid option. Unfortunately I didn't find any solution to interpret an escape sequence as such. I wrote a little test script called "test.py"

  1 # coding: utf8
  2 import sys
  3 
  4 print(sys.argv[1])
  5 
  6 l1 = u"12345	ktktktk".split(sys.argv[1])
  7 print(l1)
  8 
  9 l2 = u"633	bgt".split(sys.argv[1])
 10 print(l2)

I tried to run that script as follows (inside a guake shell on a kubuntu linux host):

python3 test.py
python3 test.py
python3 test.py ' '
python3 test.py " "

Neither of these solutions worked. I also tried this with a larger file containing "real" (and unfortunately confidential data) where for some strange reason in many (but by far not all) cases the lines were split correctly when using the 1st call.

What is the correct way to make Python 3 interpret a command line argument as escape sequence rather than as string?

Padraic Cunningham · Accepted Answer

You can use $:

python3 test.py $'	'

ANSI_002dC-Quoting

Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. Backslash escape sequences, if present, are decoded as follows:

\a
alert (bell)

\b
backspace

\e
\E
an escape character (not ANSI C)

\f
form feed



newline


carriage return

	
horizontal tab <-
............

Output:

$ python3 test.py $'	'
    
['12345', 'ktktktk']
['633', 'bgt']

wiki.bash-hackers

This is especially useful when you want to give special characters as arguments to some programs, like giving a newline to sed.

The resulting text is treated as if it was single-quoted. No further expansions happen.

The $'...' syntax comes from ksh93, but is portable to most modern shells including pdksh. A specification for it was accepted for SUS issue 7. There are still some stragglers, such as most ash variants including dash, (except busybox built with "bash compatibility" features).

Or using python:

 arg = bytes(sys.argv[1], "utf-8").decode("unicode_escape")

print(arg)

l1 = u"12345	ktktktk".split(arg)
print(l1)

l2 = u"633	bgt".split(arg)
print(l2)

Output:

$ python3 test.py '	'
    
['12345', 'ktktktk']
['633', 'bgt']

How to split an UTF-8 string by an escape sequence provided as command line argument in Python 3?

Answers (2)

Related Questions