Reputation: 355
I'm trying to seperate UTF-8 strings by a delimiter provided as command line argument in Python3. The TAB character "\t" should be a valid option. Unfortunately I didn't find any solution to interpret an escape sequence as such. I wrote a little test script called "test.py"
1 # coding: utf8
2 import sys
3
4 print(sys.argv[1])
5
6 l1 = u"12345\tktktktk".split(sys.argv[1])
7 print(l1)
8
9 l2 = u"633\tbgt".split(sys.argv[1])
10 print(l2)
I tried to run that script as follows (inside a guake shell on a kubuntu linux host):
Neither of these solutions worked. I also tried this with a larger file containing "real" (and unfortunately confidential data) where for some strange reason in many (but by far not all) cases the lines were split correctly when using the 1st call.
What is the correct way to make Python 3 interpret a command line argument as escape sequence rather than as string?
Upvotes: 3
Views: 383
Reputation: 180481
You can use $
:
python3 test.py $'\t'
Words of the form $'string' are treated specially. The word expands to string, with backslash-escaped characters replaced as specified by the ANSI C standard. Backslash escape sequences, if present, are decoded as follows:
\a
alert (bell)
\b
backspace
\e
\E
an escape character (not ANSI C)
\f
form feed
\n
newline
\r
carriage return
\t
horizontal tab <-
............
Output:
$ python3 test.py $'\t'
['12345', 'ktktktk']
['633', 'bgt']
This is especially useful when you want to give special characters as arguments to some programs, like giving a newline to sed.
The resulting text is treated as if it was single-quoted. No further expansions happen.
The $'...' syntax comes from ksh93, but is portable to most modern shells including pdksh. A specification for it was accepted for SUS issue 7. There are still some stragglers, such as most ash variants including dash, (except busybox built with "bash compatibility" features).
Or using python:
arg = bytes(sys.argv[1], "utf-8").decode("unicode_escape")
print(arg)
l1 = u"12345\tktktktk".split(arg)
print(l1)
l2 = u"633\tbgt".split(arg)
print(l2)
Output:
$ python3 test.py '\t'
['12345', 'ktktktk']
['633', 'bgt']
Upvotes: 4
Reputation: 19050
At least in Bash on Linux uou need to use CTRL + V
+ TAB
:
Example:
python utfsplit.py '``CTRL+V TAB``'
Your code otherwise works:
$ python3.4 utfsplit.py ' '
['12345', 'ktktktk']
['633', 'bgt']
NB: That tab characters can't really be displayed here :)
Upvotes: 0