Aman Singh
Aman Singh

Reputation: 1241

Python BeautifulSoup extracting the text right after a particular tag

I'm trying to extract information from a webpage using beautifulsoup and python. I want to extract the information right below a particular tag. To know if its the right tag I would like to do a comparison of its text and then extract the text in the next immediate tag.
Say for example, if the following is a part of an HTML page-source,

<div class="row">
    ::before
    <div class="four columns">
        <p class="title">Procurement type</p>
        <p class="data strong">Services</p>
    </div>
  <div class="four columns">
      <p class="title">Reference</p>
      <p class="data strong">ANAJSKJD23423-Commission</p>
  </div>
  <div class="four columns">
      <p class="title">Funding Agency</p>
      <p class="data strong">Health Commission</p>
  </div>
  ::after
</div>
<div class="row">
    ::before
    ::after
</div>
<hr>
<div class="row">
    ::before
    <div class="twelve columns">
        <p class="title">Countries</p>
        <p class="data strong">
            <span class>Belgium</span>
            ", "
            <span class>France</span>
            ", "
            <span class>Luxembourg</span>
        </p>
        <p></p>
    </div>
    ::after
</div>

I want to check if the <p class="title"> has text value as Procurement type then I want to print out Services
Similarly, if the <p class="title"> has text value as Reference then I want to print out ANAJSKJD23423-Commission and if <p class="title"> has value as Countries then print out all the countries i.e. Belgium,France,Luxembourg.

I know I can extract all the texts with <p class="data strong"> and append them to a list and later fetch all values using indexing. But the thing is, the order of the occurrence of these <p class="title> is not fixed....at some places countries could be mentioned before procurement-type. I, therefore, want to perform a check on the text values and then extract the next immediate tag's text value. I'm still new to BeautifulSoup so any help is appreciated. Thanks

Upvotes: 2

Views: 4757

Answers (3)

QHarr
QHarr

Reputation: 84465

You can also use :contains pseudo class with bs4 4.7.1. Although I have passed as a list you can separate out each condition

from bs4 import BeautifulSoup as bs
import re

html = 'yourHTML'   
soup = bs(html, 'lxml')
items=[re.sub(r'\n\s+','', item.text.strip()) for item in soup.select('p.title:contains("Procurement type") + p, p.title:contains(Reference) + p, p.title:contains(Countries) + p')]
print(items)

Output:

enter image description here

Upvotes: 2

KunduK
KunduK

Reputation: 33384

You can do it many ways.Here you go.

from bs4 import BeautifulSoup
htmldata='''<div class="row">
    ::before
    <div class="four columns">
        <p class="title">Procurement type</p>
        <p class="data strong">Services</p>
    </div>
  <div class="four columns">
      <p class="title">Reference</p>
      <p class="data strong">ANAJSKJD23423-Commission</p>
  </div>
  <div class="four columns">
      <p class="title">Funding Agency</p>
      <p class="data strong">Health Commission</p>
  </div>
  ::after
</div>
<div class="row">
    ::before
    ::after
</div>
<hr>
<div class="row">
    ::before
    <div class="twelve columns">
        <p class="title">Countries</p>
        <p class="data strong">
            <span class>Belgium</span>
            ", "
            <span class>France</span>
            ", "
            <span class>Luxembourg</span>
        </p>
        <p></p>
    </div>
    ::after
</div>'''

soup=BeautifulSoup(htmldata,'html.parser')

items=soup.find_all('p', class_='title')
for item in items:
    if ('Procurement type' in item.text) or ('Reference' in item.text):
        print(item.findNext('p').text)

Upvotes: 5

chitown88
chitown88

Reputation: 28565

You can add the argument to check for specific text when you use .find() or .find_all() then use .next_sibling or findNext() to grab the next tags with the content

Ie:

soup.find('p', {'class':'title'}, text = 'Procurement type')

Given:

html = '''<div class="row">
    ::before
    <div class="four columns">
        <p class="title">Procurement type</p>
        <p class="data strong">Services</p>
    </div>
  <div class="four columns">
      <p class="title">Reference</p>
      <p class="data strong">ANAJSKJD23423-Commission</p>
  </div>
  <div class="four columns">
      <p class="title">Funding Agency</p>
      <p class="data strong">Health Commission</p>
  </div>
  ::after
</div>
<div class="row">
    ::before
    ::after
</div>
<hr>
<div class="row">
    ::before
    <div class="twelve columns">
        <p class="title">Countries</p>
        <p class="data strong">
            <span class>Belgium</span>
            ", "
            <span class>France</span>
            ", "
            <span class>Luxembourg</span>
        </p>
        <p></p>
    </div>
    ::after
</div>'''

you could do something like:

from bs4 import BeautifulSoup     

soup = BeautifulSoup(html, 'html.parser')

alpha = soup.find('p', {'class':'title'}, text = 'Procurement type')
for sibling in alpha.next_siblings:
    try:
        print (sibling.text)
    except:
        continue

Output:

Services

or

ref = soup.find('p', {'class':'title'}, text = 'Reference')
for sibling in ref.next_siblings:
    try:
        print (sibling.text)
    except:
        continue

Output:

ANAJSKJD23423-Commission    

or

countries = soup.find('p', {'class':'title'}, text = 'Countries')
names = countries.findNext('p', {'class':'data strong'}).text.replace('", "','').strip().split('\n')
names = [name.strip() for name in names if not name.isspace()]

for country in names:
    print (country)

Output:

Belgium
France
Luxembourg

Upvotes: 1

Related Questions