M.Akram
M.Akram

Reputation: 55

Need help in Python web scraping

I wrote a simple code to scrape title, address, contct_person, phone number and website link but my program just scraping title and I don't know how to scrape all other thing because there are no classes and id's for them.

Here is my code:

import requests
from bs4 import BeautifulSoup
import csv

def get_page(url):
    response = requests.get(url)

    if not response.ok:
        print('server responded:', response.status_code)
    else:
        soup = BeautifulSoup(response.text, 'html.parser')
    return soup

def get_detail_data(soup):

    try:
        title = soup.find('a',class_="ListingDetails_Level1_SITELINK",id=False).text
    except:
        title = 'empty'  
    print(title)
    try:
        address = soup.find('div',class_="ListingDetails_Level1_CONTACTINFO",id=False).find_all('span').text
    except:
        address = "address"
    print(address)
    try:
        person_name = soup.find('a',class_="",id=False).find_all('img').text
    except:
        person_name = "empty person"
    print(person_name)
    try:
        phone_no = soup.find('img',class_="",id=False).text
    except:
        phone_no = "empty phone no"
    print(phone_no)
    try:
        website = soup.find('a',class_="",id=False).text
    except:
        website = "empty website"
    print(website)




def main():
    url = "https://secure.kelownachamber.org/Pools-Spas/Rocky%27s-Reel-System-Inc-4751"
    #get_page(url)
    get_detail_data(get_page(url))
if __name__ == '__main__':
    main()

Upvotes: 1

Views: 113

Answers (2)

Pulkit Bansal
Pulkit Bansal

Reputation: 106

Following code worked for me (this is just to show you how you can fetch data from that website so I kept it simple):

import requests
from bs4 import BeautifulSoup
result = requests.get("https://secure.kelownachamber.org/Pools-Spas/Rocky%27s-Reel-System-Inc-4751")
src = result.content
soup = BeautifulSoup(src,'html.parser')
divs  = soup.find_all("div",attrs={"class":"ListingDetails_Level1_HEADERBOXBOX"})
for tag in divs:
  try:
   title = tag.find("a",attrs={"class":"ListingDetails_Level1_SITELINK"}).text
   address = tag.find("span",attrs={"itemprop":"street-address"}).text
   postal = tag.find("span",attrs={"itemprop":"postal-code"}).text
   maincontact = tag.find("span",attrs={"class":"ListingDetails_Level1_MAINCONTACT"}).text
   siteTag = tag.find("span",attrs={"class":"ListingDetails_Level1_VISITSITE"})
   site = siteTag.find("a").attrs['href']
   print(title)
   print(address) 
   print(postal)
   print(maincontact)
   print(site)
  except:
   pass

Upvotes: 4

Dziugas
Dziugas

Reputation: 1570

In cases where the elements of the page you're trying to scrape with Beautiful Soup have no classes or id's it can be hard to tell the find() method what you're trying to find.

In that case I prefer to use either select() or select_one() which are documented here. These methods allow you to pass a CSS selector - the very same syntax that you use to tell your web browser which elements you want to style a particular way.

You can find a reference for what selectors are available to you here. I cannot give you the exact CSS expression you'll need for your case because you haven't provided a sample of the HTML you're trying to scrape, but this should get you started.

For example, if the page you're trying to scrape looked like this:

<div id="contact">
    <div>
        <a href="ListingDetails_Level1_SITELINK">Some title</a>
    </div>

    <div>
        <p>1, Sesame St., Address...... </p>
    </div>
</div>

Then to get the address you could use a CSS selector like so:

address = soup.select_one("#contact > div:nth-child(2) > p")

The above says that the address will be found by looking in the second div immediately within the div that has the id 'contact' and then looking in the paragraph immediately within that.

Upvotes: 2

Related Questions