MBWD
MBWD

Reputation: 142

How to search for specific word using BS4, then get text in same element immediately after that word?

I'm very new to BeautifulSoup and to Python. I am crawling some pages where sometimes a phone number is given and sometimes it is not. If it's there, I want to scrape it. The HTML is very simple:

<div>
    <p>Email: [email protected]</p>
    <p>Telephone: 1234567890</p>
    <p>Postal code: B3H 2F5</p>

</div>

I am checking to see if the phone number is there like this:

phoneNumber = soup.find(string='Telephone:')
if phoneNumber:
    phoneNumber = # Some code here to get the actual number 
else:
    phoneNumber = ('None')
print (phoneNumber)

There are usually several other p tags in that div, but the same ones aren't always there, so I can't rely on them as reference points. The phone number doesn't always follow the same pattern, either. The best I can do is identify that a phone number is always preceded by 'Telephone:' and is wrapped in a p tag. This seems to be the only surefire way to locate it.

What I don't understand is how to get the actual phone number, that is, anything in the

tag after 'Telephone:'

How do I get the numbers in this element after the word 'Telephone:'?

Upvotes: 0

Views: 1017

Answers (2)

MBWD
MBWD

Reputation: 142

As it turns out I was ignorant of a better way to search for this string using re.compile. I'll post the answer here in case anyone else is looking for the same thing.

What worked for me is:

phoneNumber = soup.find('p', text = re.compile('Telephone:'))

This gives me the whole p tag the string appears in, which I did not realize at first, so then I can do:

if phoneNumber:
    phoneNumber = phoneNumber.get_text().strip().replace('Telephone:', '')
else:
    phoneNumber = ('None')

Upvotes: 0

Sebastien D
Sebastien D

Reputation: 4482

With some Regex logic you can find directly the <p> tag containing the phone number :

import re
from bs4 import BeautifulSoup

html = """<div>
    <p>Email: [email protected]</p>
    <p></p>
    <p>Postal code: B3H 2F5</p>
    <p>Telephone: 1234567890</p>
</div>"""

soup = BeautifulSoup(html)

#Find the tag containing "Telephone:"
phone_tag = soup.find('p', text=re.compile('Telephone:'))

if phone_tag:
    phone = phone_tag.text.replace('Telephone:','').strip()
else:
    phone = None

Upvotes: 1

Related Questions