Reputation: 21
I am trying to use python to extract certain information from html code. for example:
<a href="#tips">Visit the Useful Tips Section</a>
and I would like to get result : Visit the Useful Tips Section
<div id="menu" style="background-color:#FFD700;height:200px;width:100px;float:left;">
<b>Menu</b><br />
HTML<br />
CSS<br />
and I would like to get Menu HTML CSS
In other word, I wish to get everything between <>and<> I am trying to write a python function that takes the html code as a string, and then extract information from there. I am stuck at string.split('<').
Upvotes: 1
Views: 144
Reputation: 306
I'd use BeautifulSoup - it gets much less cranky with mal-formed html.
Upvotes: 0
Reputation: 49547
You can use lxml
html parser.
>>> import lxml.html as lh
>>> st = ''' load your above html content into a string '''
>>> d = lh.fromstring(st)
>>> d.text_content()
'Visit the Useful Tips Section \nand I would like to get result : Visit the Useful Tips Section\n\n\nMenu\nHTML\nCSS\nand I would
like to get Menu HTML CSS\n'
or you can do
>>> for content in d.text_content().split("\n"):
... if content:
... print content
...
Visit the Useful Tips Section
and I would like to get result : Visit the Useful Tips Section
Menu
HTML
CSS
and I would like to get Menu HTML CSS
>>>
Upvotes: 1
Reputation: 2185
I understand you are trying to strip out the HTML tags and keep only the text.
You can define a regular expression that represents the tags. Then substitute all matches with the empty string.
Example:
def remove_html_tags(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
References:
Docs about python regular expressions
Upvotes: 0
Reputation: 10170
string = '<a href="#tips">Visit the Useful Tips Section</a>'
re.findall('<[^>]*>(.*)<[^>]*>', string) //return 'Visit the Useful Tips Section'
Upvotes: 1
Reputation: 399753
You should use a proper HTML parsing library, such as the HTMLParser module.
Upvotes: 3