Reputation: 1
text=u’<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" accesskey="3"><font color="#667755">\ue6f6</font></a>‘
I am a python new hand. I wanna get \ue6ec、\ue6f6、\ue6ec,how to fetch these string use re module. Thank you very much!
Upvotes: 0
Views: 134
Reputation: 4883
If you know that the page will always have that format, use BeautifulSoup parser to find what you need in HTML.
However, sometimes BeautifulSoup may break due to malformed HTML. I'd suggest you to use lxml which is python binding of libxml2. It will parse and usually correct the malformed HTML.
Upvotes: 0
Reputation: 39628
>>> from BeautifulSoup import BeautifulSoup
>>> text=u'<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" accesskey="3"><font color="#667755">\ue6f6</font></a>'
>>> t = BeautifulSoup(text)
>>> t.findAll(text=True)
[u'\ue689', u'\ue6ec', u'\ue6f6']
Upvotes: 2
Reputation: 131817
Don't use regular expressions to parse HTML. Use BeautifulSoup. Documentation for BeautifulSoup.
Upvotes: 1