Zubair Farooq
Zubair Farooq

Reputation: 123

How can i extract data from website using beautiful-soup?

I'm trying to scrape data from specific website but unfortunately failed. The reason is that data is wrapped in complex HTML structure.

Here is my Code:

import bs4
import requests



myUrl = "https://www.nbpharmacists.ca/site/findpharmacy"                                                
data=requests.get(myUrl)
soup=bs4.BeautifulSoup(data.text,'html.parser')
records = soup.find('div', class_="col-sm-12")
for dvs in records:
  divs = dvs.find('div')
  print(divs)

Expected Result:

Pharmacy Name: Albert County Pharmacy

Pharmacy Manager: Chelsea Steeves

Certificate of Operation Number: P107

Address: 5883 King Street Riverside-Albert NB E4H 4B5

Phone: (506) 882-2226

Fax: (506) 882-2101

Website: albertcountypharmacy.ca

Conclusion

My code is not giving me correct result that i want. Please suggest me best possible solution.

Upvotes: 0

Views: 64

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195418

One possible version of scraping script:

import bs4
import requests

myUrl = "https://www.nbpharmacists.ca/site/findpharmacy"
data=requests.get(myUrl)
soup=bs4.BeautifulSoup(data.text,'html.parser')

rows = []
for i, tr in enumerate(soup.select('.roster_tbl tr'), 1):
    title = tr.h2.strong.text.strip()
    manager = tr.select_one('strong:contains("Pharmacy Manager:")').find_next_sibling(text=True).strip()
    certificate = tr.select_one('strong:contains("Certificate of Operation Number:")').find_next_sibling(text=True).strip()
    address = ' '.join(div.text.strip() for div in tr.select('td:last-child div'))

    phone = tr.select_one('span:contains("Phone:")')
    if phone:
        phone = phone.find_next_sibling(text=True).strip()
    else:
        phone = '-'

    fax = tr.select_one('span:contains("Fax:")')
    if fax:
        fax = fax.find_next_sibling(text=True).strip()
    else:
        fax = '-'

    website = tr.select_one('strong:contains("Website:") + a[href]')
    if website:
        website = website['href']
    else:
        website = '-'

    print('** Pharmacy no.{} **'.format(i))
    print('Title:', title)
    print('Pharmacy Manager:', manager)
    print('Certificate of Operation Number:', certificate)
    print('Address:', address)
    print('Phone:', phone)
    print('Fax:', fax)
    print('Website:', website)
    print('*' * 80)

Prints:

** Pharmacy no.1 **
Title: Albert County Pharmacy
Pharmacy Manager: Chelsea Steeves
Certificate of Operation Number: P107
Address: 5883 King Street Riverside-Albert NB E4H 4B5
Phone: (506) 882-2226
Fax: (506) 882-2101
Website: http://albertcountypharmacy.ca
********************************************************************************
** Pharmacy no.2 **
Title: Bay Pharmacy
Pharmacy Manager: Mark Barry
Certificate of Operation Number: P157
Address: 5447 Route 117 Baie Ste Anne NB E9A 1E5
Phone: (506) 228-3880
Fax: (506) 228-3716
Website: -
********************************************************************************
** Pharmacy no.3 **
Title: Bayshore Pharmacy
Pharmacy Manager: Curtis Saunders
Certificate of Operation Number: P295
Address: 600 Main Street Suite C 150 Saint John NB E2K 1J5
Phone: (506) 799-4920
Fax: (855) 328-4736
Website: http://Bayshore Specialty Pharmacy
********************************************************************************

...and so on.

Upvotes: 0

rrcal
rrcal

Reputation: 3752

If you just explore the hierarchy you should be able to find your answer, specifically on ids, divs and tables. See below one option.


myUrl = "https://www.nbpharmacists.ca/site/findpharmacy"                                                
data=requests.get(myUrl)
soup=bs4.BeautifulSoup(data.text,'html.parser')

roster = soup.find('div', attrs={'id': 'rosterRecords'})
tables = roster.findAll('table')

result = [] #initialize a list for all results

for table in tables:    
    info = table.find('td').find('p').text.strip()
    certificate = info.split('Certificate of Operation Number:')[-1].strip()
    manager = info.split('Pharmacy Manager:')[1]\
                    .split('Certificate of Operation Number:')[0].strip()
    addr = table.findAll('td')[-1].text.strip()
    phone = addr.split('Phone:')[-1].split('Fax:')[0].strip()
    fax = addr.split('Fax:')[1].strip().split('\n')[0].strip()
    address = addr.split('Phone:')[0].strip()

    res = {
        'Pharmacy Name': table.find('h2').find('span').text.strip(),
        'Certificate of Operation Number': certificate,
        'Pharmacy Manager': manager,
        'Phone Number': phone,
        'Fax Number': fax,
        'Address': address,
    }

    try:
        res['website'] = table.findAll('td')[-1].find('a').get('href')
    except AttributeError:
        res['website'] = None
    result.append(res) #append pharmacy info

print (result[0])

Out[25]: 
{'Pharmacy Name': 'Albert County Pharmacy',
 'Certificate of Operation Number': 'P107',
 'Pharmacy Manager': 'Chelsea Steeves',
 'Phone Number': '(506) 882-2226',
 'Fax Number': '(506) 882-2101',
 'Address': '5883 King Street \nRiverside-Albert NB E4H 4B5',
 'website': 'http://albertcountypharmacy.ca'}

Upvotes: 1

Related Questions