Reputation: 9348
Part of a html is structured as below. I want to get the job “title” and “time” from it. I can get them separately, like:
from bs4 import BeautifulSoup
pages = '<div class="content"> \
<a href="Org"> \
<h3 class="title"> \
Dep. Manager</h3> \
</a> \
<div class="contributor"></div> \
<p>John</p> \
<time class="time"> \
<span class="timestamp">May 02 2016</span> \
</time> \
</div>'
soup = BeautifulSoup(pages, "lxml")
soup.prettify()
s = soup.find_all(class_ = "title")[0]
t = soup.find_all('span', class_ = "timestamp")[0].text.strip()
pp_title = s.text.strip()
print t
print (pp_title)
It returns me that wanted.
Dep. Manager
May 02 2016
For the "time", I want another way to get it, as the “time” is always below the “title”. I tried this line to get the “time”, it doesn’t work.
print (s.parent.next_sibling.next_sibling)
What’s the right way to get the “time” from the relationship to “title”? thank you.
Upvotes: 1
Views: 1202
Reputation: 84465
Select for the shared parent then grab the children by class. Assumes parent always has both. You can change selector to ensure has both if required.
from bs4 import BeautifulSoup as bs
html = '''
<div class="content"> \
<a href="Org"> \
<h3 class="title"> \
Dep. Manager</h3> \
</a> \
<div class="contributor"></div> \
<p>John</p> \
<time class="time"> \
<span class="timestamp">May 02 2016</span> \
</time> \
</div>
'''
soup = bs(html, 'lxml')
items = [i.text.strip() for i in soup.select('.content:has(.title) .title, .content:has(.title) .timestamp')]
print(items)
Upvotes: 1
Reputation: 71451
You can use soup.find_all
with re
:
import re
from bs4 import BeautifulSoup as soup
result = [i.get_text(strip=True) for i in soup(pages, 'html.parser').find_all(re.compile('h3|span'), {'class':re.compile('title|timestamp')})]
Output:
['Dep. Manager', 'May 02 2016']
Upvotes: 1
Reputation: 314
I don't know whether the issue lies in the string you are providing or somewhere else, but every other call to next_sibling
returns u' '
. So I tried this:
s.parent.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.findChildren()[0]
I know it's long, but it gets the job done.
Upvotes: 1
Reputation: 12255
You can findParent
with specifying details:
t = s.findParent("div", class_='content').find('span', class_='timestamp').text.strip()
Example:
titles = soup.find_all(class_="title")
for title in titles:
timestamp = title.findParent("div", class_='content').find('span', class_='timestamp').text.strip()
print(title.text.strip(), timestamp)
Upvotes: 2