Jеdd
Jеdd

Reputation: 71

How do I get rid of U+200B (Unicode zero width space) in my code?

I have this piece of Python code:

# Subroutine to calculate VAT​
def VAT(Total):​
    return Total * 0.05 ​

# Main program​
Total = 100.12​
ValueAddedTax = VAT(Total)​
ToPay = Total + ValueAddedTax​
print("Total £{:.2f} VAT £{:.2f} To pay £{:.2f}".format(Total, ValueAddedTax, ToPay))

When running this, I get:

    def VAT(Total):​
                   ^
SyntaxError: invalid character in identifier

The reason is that the code contains U+200B ZERO WIDTH SPACE (UTF-8 encoding: E2 80 8B), as seen in the output of hexdump -C:

00000000  23 20 53 75 62 72 6f 75  74 69 6e 65 20 74 6f 20  |# Subroutine to |
00000010  63 61 6c 63 75 6c 61 74  65 20 56 41 54 e2 80 8b  |calculate VAT...|
00000020  0a 64 65 66 20 56 41 54  28 54 6f 74 61 6c 29 3a  |.def VAT(Total):|
00000030  e2 80 8b 0a 20 20 20 20  72 65 74 75 72 6e 20 54  |....    return T|
00000040  6f 74 61 6c 20 2a 20 30  2e 30 35 20 e2 80 8b 0a  |otal * 0.05 ....|
00000050  0a 23 20 4d 61 69 6e 20  70 72 6f 67 72 61 6d e2  |.# Main program.|
00000060  80 8b 0a 54 6f 74 61 6c  20 3d 20 31 30 30 2e 31  |...Total = 100.1|
00000070  32 e2 80 8b 0a 56 61 6c  75 65 41 64 64 65 64 54  |2....ValueAddedT|
00000080  61 78 20 3d 20 56 41 54  28 54 6f 74 61 6c 29 e2  |ax = VAT(Total).|
00000090  80 8b 0a 54 6f 50 61 79  20 3d 20 54 6f 74 61 6c  |...ToPay = Total|
000000a0  20 2b 20 56 61 6c 75 65  41 64 64 65 64 54 61 78  | + ValueAddedTax|
000000b0  e2 80 8b 0a 70 72 69 6e  74 28 22 54 6f 74 61 6c  |....print("Total|
000000c0  20 c2 a3 7b 3a 2e 32 66  7d 20 56 41 54 20 c2 a3  | ..{:.2f} VAT ..|
000000d0  7b 3a 2e 32 66 7d 20 54  6f 20 70 61 79 20 c2 a3  |{:.2f} To pay ..|
000000e0  7b 3a 2e 32 66 7d 22 2e  66 6f 72 6d 61 74 28 54  |{:.2f}".format(T|
000000f0  6f 74 61 6c 2c 20 56 61  6c 75 65 41 64 64 65 64  |otal, ValueAdded|
00000100  54 61 78 2c 20 54 6f 50  61 79 29 29 0a           |Tax, ToPay)).|
0000010d

I was wondering how to get rid of all of the zero width spaces.

Upvotes: 7

Views: 22328

Answers (1)

mkrieger1
mkrieger1

Reputation: 23235

You can get rid of those characters by replacing them with an empty string using sed:

$ sed 's/\xe2\x80\x8b//g' INPUTFILE >OUTPUTFILE

or, modifying the file in-place:

$ sed -i 's/\xe2\x80\x8b//g' INPUTFILE

Upvotes: 9

Related Questions