Reputation: 630
I was trying to remove "\u3000", "\t", "\n" and "\ax03" in the string below in my Python Jupyter Notebook (Python 3.6).
string = "\u3000\u3000(三)履行服务\n贸易领域\t开放承诺 \ax03"
re.sub("\\[a-z0-9]+", "", string)
However, this does not return what I wanted, although this pattern worked perfectly in Notepad++.
Upvotes: 0
Views: 1365
Reputation: 25023
To enter literal Unicode characters in a program, there are options: enter the character directly, e.g. "a", or use a special sequence of characters, e.g. "\u3000". There is extensive information in the Python 3 documentation at Unicode HOWTO.
What happened when you tried it in Notepad++ is that it used the actual characters you entered without interpreting them further, so when you saw "\u3000" it really was a backslash, a "u", a "3", a "0", a "0", and a final "0".
However, in the Python code it saw the "\u" and thought aha! this is a Unicode character, let me find out what the codepoint is from the next four hexadecimal characters. (3000 hexadecimal = 12288 decimal.)
If you print the string, it may show them in the \u format if necessary owing to output limitations. But we can make it show the actual codepoints of the characters by iterating over the string and outputting the ord() value of each character:
>>> string = "\u3000\u3000(三)履行服务\n贸易领域\t开放承诺 \ax03"
>>> string
'\u3000\u3000(三)履行服务\n贸易领域\t开放承诺 \x07x03'
>>> for c in string:
... print(ord(c))
...
12288
12288
65288
19977
65289
23653
34892
26381
21153
10
36152
26131
39046
22495
9
24320
25918
25215
35834
32
7
120
48
51
(I am not sure what is intended with the "\ax03" part - could it be a typo for "\x03"?)
When you tried
re.sub("\\[a-z0-9]+", "", string)
it was using an actual backslash followed by digits.
What you need to do is supply the characters you want to remove in the escaped format:
re.sub("[\u3000\t\n\ax03]", "", string)
which returns:
'(三)履行服务贸易领域开放承诺 '
Upvotes: 1