norritt
norritt

Reputation: 355

How to split an UTF-8 string by an escape sequence provided as command line argument in Python 3?

I'm trying to seperate UTF-8 strings by a delimiter provided as command line argument in Python3. The TAB character "\t" should be a valid option. Unfortunately I didn't find any solution to interpret an escape sequence as such. I wrote a little test script called "test.py"

  1 # coding: utf8
  2 import sys
  3 
  4 print(sys.argv[1])
  5 
  6 l1 = u"12345\tktktktk".split(sys.argv[1])
  7 print(l1)
  8 
  9 l2 = u"633\tbgt".split(sys.argv[1])
 10 print(l2)

I tried to run that script as follows (inside a guake shell on a kubuntu linux host):

  1. python3 test.py \t
  2. python3 test.py \t
  3. python3 test.py '\t'
  4. python3 test.py "\t"

Neither of these solutions worked. I also tried this with a larger file containing "real" (and unfortunately confidential data) where for some strange reason in many (but by far not all) cases the lines were split correctly when using the 1st call.

What is the correct way to make Python 3 interpret a command line argument as escape sequence rather than as string?

Upvotes: 3

Views: 383

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180481

You can use $:

python3 test.py $'\t'

ANSI_002dC-Quoting

Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. Backslash escape sequences, if present, are decoded as follows:

\a
alert (bell)

\b
backspace

\e
\E
an escape character (not ANSI C)

\f
form feed

\n
newline

\r
carriage return

\t
horizontal tab <-
............

Output:

$ python3 test.py $'\t'
    
['12345', 'ktktktk']
['633', 'bgt']

wiki.bash-hackers

This is especially useful when you want to give special characters as arguments to some programs, like giving a newline to sed.

The resulting text is treated as if it was single-quoted. No further expansions happen.

The $'...' syntax comes from ksh93, but is portable to most modern shells including pdksh. A specification for it was accepted for SUS issue 7. There are still some stragglers, such as most ash variants including dash, (except busybox built with "bash compatibility" features).

Or using python:

 arg = bytes(sys.argv[1], "utf-8").decode("unicode_escape")

print(arg)

l1 = u"12345\tktktktk".split(arg)
print(l1)

l2 = u"633\tbgt".split(arg)
print(l2)

Output:

$ python3 test.py '\t'
    
['12345', 'ktktktk']
['633', 'bgt']

Upvotes: 4

James Mills
James Mills

Reputation: 19050

At least in Bash on Linux uou need to use CTRL + V + TAB:

Example:

python utfsplit.py '``CTRL+V TAB``'

Your code otherwise works:

$ python3.4 utfsplit.py '       '

['12345', 'ktktktk']
['633', 'bgt']

NB: That tab characters can't really be displayed here :)

Upvotes: 0

Related Questions