Reputation: 59
I want to extract from the following html code only the placename using python and bs4.
<div class="results-list" id="theaterlist">
<table>
<tr class="trspacer">
<td>
<a href="theater.aspx?id=4000642">
<h2 class="placename">
Hyde Park
<span class="boldelement">
Richmond Avenue 56 ls61bz
</span>
</h2>
</a>
I m using the following code but i get the address too.
mydivs = soup.find("div", {"id": "theaterlist"})
lis = mydivs.select("a[href*=theater.aspx]")
for x in lis:
theater = x.find('h2', class_='placename')
print theater.text
Any help would be appreciated.
Upvotes: 1
Views: 625
Reputation: 22440
Try this:
for x in soup.select("a[href*=theater.aspx]"):
theater = x.find('h2', class_='placename')
print(theater.contents[0].strip())
Upvotes: 0
Reputation: 12015
soup.find("div", {"id": "theaterlist"}).find('h2', class_='placename').text.strip()
# 'Hyde Park\n \n Richmond Avenue 56 ls61bz'
Upvotes: 0
Reputation: 195438
For getting the text only for the element (not child elements) you can use .find(text=True)
:
data = """
<div class="results-list" id="theaterlist">
<table>
<tr class="trspacer">
<td>
<a href="theater.aspx?id=4000642">
<h2 class="placename">
Hyde Park
<span class="boldelement">
Richmond Avenue 56 ls61bz
</span>
</h2>
</a>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print(soup.find('h2').find(text=True).strip())
Prints:
Hyde Park
Upvotes: 3