Reputation: 142
I'm very new to BeautifulSoup and to Python. I am crawling some pages where sometimes a phone number is given and sometimes it is not. If it's there, I want to scrape it. The HTML is very simple:
<div>
<p>Email: [email protected]</p>
<p>Telephone: 1234567890</p>
<p>Postal code: B3H 2F5</p>
</div>
I am checking to see if the phone number is there like this:
phoneNumber = soup.find(string='Telephone:')
if phoneNumber:
phoneNumber = # Some code here to get the actual number
else:
phoneNumber = ('None')
print (phoneNumber)
There are usually several other p tags in that div, but the same ones aren't always there, so I can't rely on them as reference points. The phone number doesn't always follow the same pattern, either. The best I can do is identify that a phone number is always preceded by 'Telephone:' and is wrapped in a p tag. This seems to be the only surefire way to locate it.
What I don't understand is how to get the actual phone number, that is, anything in the
tag after 'Telephone:'
How do I get the numbers in this element after the word 'Telephone:'?
Upvotes: 0
Views: 1017
Reputation: 142
As it turns out I was ignorant of a better way to search for this string using re.compile
. I'll post the answer here in case anyone else is looking for the same thing.
What worked for me is:
phoneNumber = soup.find('p', text = re.compile('Telephone:'))
This gives me the whole p tag the string appears in, which I did not realize at first, so then I can do:
if phoneNumber:
phoneNumber = phoneNumber.get_text().strip().replace('Telephone:', '')
else:
phoneNumber = ('None')
Upvotes: 0
Reputation: 4482
With some Regex logic you can find directly the <p>
tag containing the phone number :
import re
from bs4 import BeautifulSoup
html = """<div>
<p>Email: [email protected]</p>
<p></p>
<p>Postal code: B3H 2F5</p>
<p>Telephone: 1234567890</p>
</div>"""
soup = BeautifulSoup(html)
#Find the tag containing "Telephone:"
phone_tag = soup.find('p', text=re.compile('Telephone:'))
if phone_tag:
phone = phone_tag.text.replace('Telephone:','').strip()
else:
phone = None
Upvotes: 1