Reputation: 107

Dynamic Web scraping

I am trying to scrape this page ("http://www.arohan.in/branch-locator.php") in which when I select the state and city, an address will be displayed and I have to write the state,city and address in csv/excel file. I am able to reach this till step, now I am stuck.

Here is my code:

from selenium import webdriver  
from selenium.webdriver.support.ui import WebDriverWait

chrome_path=  r"C:\Users\IBM_ADMIN\Downloads\chromedriver_win32\chromedriver.exe"
driver =webdriver.Chrome(chrome_path)
driver.get("http://www.arohan.in/branch-locator.php")
select = Select(driver.find_element_by_name('state'))
select.select_by_visible_text('Bihar')
drop = Select(driver.find_element_by_name('branch'))
city_option = WebDriverWait(driver, 5).until(lambda x: x.find_element_by_xpath("//select[@id='city1']/option[text()='Gaya']"))
city_option.click()

Upvotes: 0

Answers (3)

Martin Evans

Reputation: 46779

A better approach would be to avoid using selenium. That is useful if you require the javascript processing required to render the HTML. In your case, this is not needed. The required information is already contained within the HTML.

What is needed is to first make a request to get a page containing all of the states. Then for each state, request the list of branch. Then for each state/branch combination, a URL request can be made to get the HTML containing the address. This happens to be contained in the second <li> entry following a <ul class='address_area'> entry:

from bs4 import BeautifulSoup
import requests
import csv
import time

# Get a list of available states
r = requests.get('http://www.arohan.in/branch-locator.php')
soup = BeautifulSoup(r.text, 'html.parser')
state_select = soup.find('select', id='state1')
states = [option.text for option in state_select.find_all('option')[1:]]

# Open an output CSV file
with open('branch addresses.csv', 'w', newline='', encoding='utf-8') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(['State', 'Branch', 'Address'])

    # For each state determine the available branches
    for state in states:
        r_branches = requests.post('http://www.arohan.in/Ajax/ajax_branch.php', data={'ajax_state':state})
        soup = BeautifulSoup(r_branches.text, 'html.parser')

        # For each branch, request a page contain the address
        for option in soup.find_all('option')[1:]:
            time.sleep(0.5)     # Reduce server loading
            branch = option.text
            print("{}, {}".format(state, branch))
            r_branch = requests.get('http://www.arohan.in/branch-locator.php', params={'state':state, 'branch':branch})
            soup_branch = BeautifulSoup(r_branch.text, 'html.parser')
            ul = soup_branch.find('ul', class_='address_area')

            if ul:
                address = ul.find_all('li')[1].get_text(strip=True)
                row = [state, branch, address]
                csv_output.writerow(row)
            else:
                print(soup_branch.title)

Giving you an output CSV file starting:

State,Branch,Address
West Bengal,Kolkata,"PTI Building, 4th Floor,DP Block, DP-9, Salt Lake CityCalcutta, 700091"
West Bengal,Maheshtala,"Narmada Park, Par Bangla,Baddir Bandh Bus Stop,Opp Lane Kismat Nungi Road,Maheshtala,Kolkata- 700140. (W.B)"
West Bengal,ShyamBazar,"First Floor, 6 F.b.T. Road,Ward No.-6,Kolkata-700002"

You should slow the script down using a time.sleep(0.5) to avoid too much loading on the server.

Note: [1:] is used as the first item in the drop down lists is not a branch or state, but a Select Branch entry.

Upvotes: 0

SIM

Reputation: 22440

In a slightly organized manner:

import requests
from bs4 import BeautifulSoup

link = "http://www.arohan.in/branch-locator.php?"


def get_links(session,url,payload):
    session.headers["User-Agent"] = "Mozilla/5.0"
    res = session.get(url,params=payload)
    soup = BeautifulSoup(res.text,"lxml")
    item = [item.text for item in soup.select(".address_area p")]
    print(item)

if __name__ == '__main__':
    for st,br in zip(['Bihar','West Bengal'],['Gaya','Kolkata']):
        payload = {
            'state':st ,
            'branch':br 
        }
        with requests.Session() as session:
            get_links(session,link,payload)

Output:

['Branch', 'House no -10/12, Ward-18, Holding No-12, Swarajpuri Road, Near Bank of Baroda, Gaya Pin 823001(Bihar)', 'N/A', 'N/A']
['Head Office', 'PTI Building, 4th Floor, DP Block, DP-9, Salt Lake City Calcutta, 700091', '+91 33 40156000', '[email protected]']

Upvotes: 1

OutRideACrisis

Reputation: 185

Is selenium necessary? looks like you can use URLs to arrive at what you want: http://www.arohan.in/branch-locator.php?state=Assam&branch=Mirza.

Get a list of the state / branch combinations then use the beautiful soup tutorial to get the info from each page.

Upvotes: 2

Dynamic Web scraping

Answers (3)

Related Questions