Reputation: 11
I am only new to python and i am having trouble getting the text between the tags, here is the html of the full table.
<div id="menu">
<h4 style="display:none">Horse Photo</h4>
<ul style="margin-top:5px;border-radius:6px">
<li style="padding:0">
<img src="/images/unknown_horse.png" style="width:298px;margin-bottom:-3px;border-radius:5px;">
</li>
</ul>
<h4>Horse Profile</h4>
<ul>
<li>Age<span>3yo</span></li>
<li>Foaled<span>17/11/2014</span></li>
<li>Country<span>New Zealand</span></li>
<li>Location<span>Kembla Grange</span></li>
<li>Sex<span>Filly</span></li>
<li>Colour<span>Grey</span></li>
<li>Sire<span>Mastercraftsman</span></li>
<li>Dam<span>In Essence</span></li>
<li>Trainer
<span>
<a href="/trainer/26970-r-l-price/">R & L Price</a>
</span>
</li>
<li>Earnings<span>$19,795</span></li>
</ul>
<h4>Owners</h4>
<ul>
<li style="font:normal 12px 'Tahoma">Bell View Park Stud (Mgr: A P Mackrell)</li>
</ul>
</div>
Upvotes: 1
Views: 357
Reputation: 195408
For parsing HTML use beautifulsoup
package. That way you can select elements of your html document with ease. To print all text within <span>
tags, you can use this example:
data = """
<div id="menu">
<h4 style="display:none">Horse Photo</h4>
<ul style="margin-top:5px;border-radius:6px">
<li style="padding:0">
<img src="/images/unknown_horse.png" style="width:298px;margin-bottom:-3px;border-radius:5px;">
</li>
</ul>
<h4>Horse Profile</h4>
<ul>
<li>Age<span>3yo</span></li>
<li>Foaled<span>17/11/2014</span></li>
<li>Country<span>New Zealand</span></li>
<li>Location<span>Kembla Grange</span></li>
<li>Sex<span>Filly</span></li>
<li>Colour<span>Grey</span></li>
<li>Sire<span>Mastercraftsman</span></li>
<li>Dam<span>In Essence</span></li>
<li>Trainer
<span>
<a href="/trainer/26970-r-l-price/">R & L Price</a>
</span>
</li>
<li>Earnings<span>$19,795</span></li>
</ul>
<h4>Owners</h4>
<ul>
<li style="font:normal 12px 'Tahoma">Bell View Park Stud (Mgr: A P Mackrell)</li>
</ul>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for li in soup.select('span'):
if li.text.strip() == '':
continue
print(li.text)
Will print:
3yo
17/11/2014
New Zealand
Kembla Grange
Filly
Grey
Mastercraftsman
In Essence
R & L Price
$19,795
Upvotes: 1
Reputation: 3118
There are plenty of options to work with HTML/XML. I prefer parsel
package. You can install it to your environment with the following command:
$ pip install parsel
After that you can use it like this:
from parsel import Selector
sel = Selector(html)
sel.css('ul li::text').extract()
# ['Age',
# 'Foaled',
# 'Country',
# 'Location',
# 'Sex',
# 'Colour',
# 'Sire',
# 'Dam',
# 'Trainer',
# 'Earnings',
# 'Bell View Park Stud (Mgr: A P Mackrell)']
More detailed description can be found here.
Upvotes: 0