Reputation: 73
I have this website https://www.serviceseeking.com.au/profile/106871-cld-electrical?source=
From this, I am only interested in the Name, Address, ABN and licence number associated with VIC Energy Safe only.
All of the above information is present in <div class="row">
tag. But I am having trouble extracting all of the above information separately.
Here's what I have been trying so far:
from bs4 import BeautifulSoup # required to parse html
import requests # required to make request
import re
html_text = requests.get('https://www.serviceseeking.com.au/profile/35359-baker-s-electrical-services-p-l?source=').text
soup = BeautifulSoup(html_text,'lxml')
#Electrician Name
name=[]
name = soup.find('div', class_ = "row mt20").text
print(f'Name: {name}')
#Licence Number
res=[]
ln=soup.find_all('div', class_='row')
try:
for item in ln:
if ('VIC Energy Safe' in item.text):
licence = item.select_one('div').text
res = re.findall(r'Safe(\w+)', licence)[0]
res = int(re.search(r'\d+', res).group(0))
#print(res)
except:
print(" ")
print("License Number=",res)
Output:
Name: David Baker
License Number= 29402
I have been using the same technique (as licence Number) to extract Address and ABN.
The code seems to work fine for this website. However, I have around 300+ profiles from this website and it doesn't seem to work for all of the websites. For example, it fails for this profile. https://www.serviceseeking.com.au/profile/197521-elcom-electrical-group?source=
Can someone give me a workable solution to extract all of this information with ease?
(PS: I think I should split the regex string, but I just don't know how)
Upvotes: 1
Views: 459
Reputation: 195438
Try:
import bs4
import requests
urls = [
"https://www.serviceseeking.com.au/profile/106871-cld-electrical?source=",
"https://www.serviceseeking.com.au/profile/197521-elcom-electrical-group?source=",
]
for url in urls:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
name = soup.select_one(".ficon-user").find_next("div").get_text(strip=True)
addr = (
soup.select_one(".ficon-coverage").find_next("div").get_text(strip=True)
)
abn = soup.select_one('strong:-soup-contains("ABN")').find_next_sibling(
text=True
)
vic = soup.select_one(
'.license-name:-soup-contains("VIC Energy Safe") + div'
)
vic = vic.get_text(strip=True) if vic else "N/A"
print(name)
print(addr)
print(abn)
print(vic) # or print(vic.split("-")[-1]) if you want only the number
print("-" * 80)
Prints:
Chris Donovan
Lilydale, VIC
94385612994
23635
--------------------------------------------------------------------------------
Emre Cekuc
Roxburgh Park, VIC
82689908730
REC-28370
--------------------------------------------------------------------------------
Upvotes: 1