Reputation: 340
I'm new to python,I have a string like:
s= 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
I want to remove all the unicode literals in a string like:
'\xc3\x82\xc2\xae'
I need output like:
'HDFC FTAE Greater China'
Can anyone help me with this?
Thank you
Upvotes: 3
Views: 8121
Reputation: 177901
On Python 2 (default string type is bytes):
>>> s = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
>>> s.decode('ascii',errors='ignore').encode('ascii')
'HDCF FTAE Greater China'
On Python 3 (default string type is Unicode):
>>> s = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
>>> s.encode('ascii',errors='ignore').decode('ascii')
'HDCF FTAE Greater China'
Note that the original string is a mojibake. Ideally fix how the string was read, but you can undo the damage with (Python 3):
>>> s.encode('latin1').decode('utf8').encode('latin1').decode('utf8')
'HDCF® FTAE® Greater China'
The original string was double-encoded as UTF-8. This works by converting the string directly 1:1 back to bytes1, decoding as UTF-8, then converting directly back to bytes again and decoding with UTF-8 again.
Here's the Python 2 version:
>>> s = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
>>> print s.decode('utf8').encode('latin1').decode('utf8')
HDCF® FTAE® Greater China
1This works because the latin1
codec is a 256-byte encoding and directly maps to the first 256 Unicode codepoints.
Upvotes: 4
Reputation: 8224
If your goal is to limit the string to ASCII-compatible characters, you can encode it into ASCII and ignore unencodable characters, and then decode it again:
x = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
print(x.encode("ascii", "ignore").decode("utf-8"))
produces HDCF FTAE Greater China
.
Check out str.encode() and bytes.decode()
Upvotes: 3
Reputation: 7844
You can filter your string using the string.printable
function to check whether your characters can be printed:
import string
s= 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
printable = set(string.printable)
s = "".join(filter(lambda c: c in printable, s))
print(s)
Output:
HDCF FTAE Greater China
Reference to this question.
Upvotes: 2
Reputation: 573
May be this help,
s = 'HDCF\xc3\x82\xc2\xae FTAE\xc3\x82\xc2\xae Greater China'
d = ''.join([i for i in s if ord(i) < 127])
print(d)
# OUTPUT as: HDCF FTAE Greater China
Upvotes: 0