Reputation: 795
During web scraping and after getting rid of all html tags, I got the black telephone character \u260e in unicode (☎). But unlike this response I do want to get rid of it too.
I used the following regular expressions in Scrapy to eliminate html tags:
pattern = re.compile("<.*?>| |&",re.DOTALL|re.M)
Then I tried to match \u260e and I think I got caught by the backslash plague. I tried unsuccessfully this patterns:
pattern = re.compile("<.*?>| |&|\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>| |&|\\u260e",re.DOTALL|re.M)
pattern = re.compile("<.*?>| |&|\\\\u260e",re.DOTALL|re.M)
None of this worked and I still have \u260e as an output. How can I make this disappear?
Upvotes: 9
Views: 1445
Reputation: 14778
Using Python 2.7.3, the following works fine for me:
import re
pattern = re.compile(u"<.*?>| |&|\u260e",re.DOTALL|re.M)
s = u"bla ble \u260e blo"
re.sub(pattern, "", s)
Output:
u'bla ble blo'
As pointed by @Zack, this works due to the fact that the string is now in unicode, i.e., the string is already converted, and the sequence of characters \u260e
is now the -- probably -- two bytes used to write that little black phone ☎ (:
Once both the string to be searched and the regular expression have the black phone itself, and not the sequence of characters \u260e
, they both match.
Upvotes: 7
Reputation: 10260
If your string is already unicode, there's two easy ways. The second one will affect more than just the ☎, obviously.
>>> import string
>>> foo = u"Lorum ☎ Ipsum"
>>> foo.replace(u'☎', '')
u'Lorum Ipsum'
>>> "".join(s for s in foo if s in string.printable)
u'Lorum Ipsum'
string.printable
Upvotes: 4