Meghna Panda
Meghna Panda

Reputation: 73

Extracting Multiple information present in div tag but same class name

I have this website https://www.serviceseeking.com.au/profile/106871-cld-electrical?source=

From this, I am only interested in the Name, Address, ABN and licence number associated with VIC Energy Safe only.

enter image description here

All of the above information is present in <div class="row"> tag. But I am having trouble extracting all of the above information separately.

Here's what I have been trying so far:

from bs4 import BeautifulSoup  # required to parse html
import requests  # required to make request
import re

html_text = requests.get('https://www.serviceseeking.com.au/profile/35359-baker-s-electrical-services-p-l?source=').text
soup = BeautifulSoup(html_text,'lxml')

#Electrician Name
name=[]
name = soup.find('div', class_ = "row mt20").text
print(f'Name: {name}')

#Licence Number
res=[]
ln=soup.find_all('div', class_='row')
try:
    for item in ln:
        if ('VIC Energy Safe' in item.text):
            licence = item.select_one('div').text
            res = re.findall(r'Safe(\w+)', licence)[0]
            res = int(re.search(r'\d+', res).group(0))
            #print(res)
            
except:
    print(" ")

print("License Number=",res)

Output:

Name: David Baker
License Number= 29402

I have been using the same technique (as licence Number) to extract Address and ABN.

The code seems to work fine for this website. However, I have around 300+ profiles from this website and it doesn't seem to work for all of the websites. For example, it fails for this profile. https://www.serviceseeking.com.au/profile/197521-elcom-electrical-group?source=

Can someone give me a workable solution to extract all of this information with ease?

(PS: I think I should split the regex string, but I just don't know how)

Upvotes: 1

Views: 459

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195438

Try:

import bs4
import requests

urls = [
    "https://www.serviceseeking.com.au/profile/106871-cld-electrical?source=",
    "https://www.serviceseeking.com.au/profile/197521-elcom-electrical-group?source=",
]

for url in urls:

    soup = BeautifulSoup(requests.get(url).content, "html.parser")

    name = soup.select_one(".ficon-user").find_next("div").get_text(strip=True)
    addr = (
        soup.select_one(".ficon-coverage").find_next("div").get_text(strip=True)
    )

    abn = soup.select_one('strong:-soup-contains("ABN")').find_next_sibling(
        text=True
    )

    vic = soup.select_one(
        '.license-name:-soup-contains("VIC Energy Safe") + div'
    )
    vic = vic.get_text(strip=True) if vic else "N/A"

    print(name)
    print(addr)
    print(abn)
    print(vic)  # or print(vic.split("-")[-1])  if you want only the number

    print("-" * 80)

Prints:

Chris Donovan
Lilydale, VIC
94385612994
23635
--------------------------------------------------------------------------------
Emre Cekuc
Roxburgh Park, VIC
82689908730
REC-28370
--------------------------------------------------------------------------------

Upvotes: 1

Related Questions