Mark K
Mark K

Reputation: 9348

Python BeautifulSoup to get content from parent/sibling relationship

Part of a html is structured as below. I want to get the job “title” and “time” from it. I can get them separately, like:

from bs4 import BeautifulSoup


pages = '<div class="content"> \
                <a href="Org"> \
                        <h3 class="title"> \
                            Dep. Manager</h3> \
                        </a> \
                <div class="contributor"></div> \
                <p>John</p> \
                <time class="time"> \
                        <span class="timestamp">May 02 2016</span> \
                    </time> \
                </div>'


soup = BeautifulSoup(pages, "lxml")


soup.prettify()


s = soup.find_all(class_ = "title")[0]

t = soup.find_all('span', class_ = "timestamp")[0].text.strip()


pp_title = s.text.strip()

print t

print (pp_title)

It returns me that wanted.

Dep. Manager
May 02 2016

For the "time", I want another way to get it, as the “time” is always below the “title”. I tried this line to get the “time”, it doesn’t work.

print (s.parent.next_sibling.next_sibling)

What’s the right way to get the “time” from the relationship to “title”? thank you.

Upvotes: 1

Views: 1202

Answers (4)

QHarr
QHarr

Reputation: 84465

Select for the shared parent then grab the children by class. Assumes parent always has both. You can change selector to ensure has both if required.

from bs4 import BeautifulSoup as bs

html = '''
<div class="content"> \
    <a href="Org"> \
                        <h3 class="title"> \
                            Dep. Manager</h3> \
                        </a> \
    <div class="contributor"></div> \
    <p>John</p> \
    <time class="time"> \
        <span class="timestamp">May 02 2016</span> \
    </time> \
</div>
'''
soup = bs(html, 'lxml')
items = [i.text.strip() for i in soup.select('.content:has(.title) .title, .content:has(.title) .timestamp')]
print(items)

Upvotes: 1

Ajax1234
Ajax1234

Reputation: 71451

You can use soup.find_all with re:

import re
from bs4 import BeautifulSoup as soup
result = [i.get_text(strip=True) for i in soup(pages, 'html.parser').find_all(re.compile('h3|span'), {'class':re.compile('title|timestamp')})]

Output:

['Dep. Manager', 'May 02 2016']

Upvotes: 1

Tekno
Tekno

Reputation: 314

I don't know whether the issue lies in the string you are providing or somewhere else, but every other call to next_sibling returns u' '. So I tried this:

s.parent.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.findChildren()[0]

I know it's long, but it gets the job done.

Upvotes: 1

Sers
Sers

Reputation: 12255

You can findParent with specifying details:

t = s.findParent("div", class_='content').find('span', class_='timestamp').text.strip()

Example:

titles = soup.find_all(class_="title")
for title in titles:
    timestamp = title.findParent("div", class_='content').find('span', class_='timestamp').text.strip()
    print(title.text.strip(), timestamp)

Upvotes: 2

Related Questions