Reputation: 5283
I am trying to crawl data from website but the problem is there is load-more button to view next 50 records, same way i u have to click until records ends.
I am only able to fetch 50 names and addresses. need to fetch all untill load-more ends.
for dynamic click on on button i am using selenium with python.
I wanted to find name, address and contact number of all the retailers city wise
My Try:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
url = "https://www.test.in/chemists/medical-store/gujarat/surat"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(1)
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
try:
for row in soup.find_all("div", {"class":"listing "}):
#print(row.get_text())
name = row.h3.a.string
address = row.p.get_text()
#contactnumber = need to find (can view after click on retailer name )
print(name)
print(address)
print(contactnumber)
button = browser.find_element_by_id("loadmore")
button.click()
except TimeoutException as ex:
isrunning = 0
#browser.close()
#browser.quit()
Upvotes: 1
Views: 345
Reputation: 4596
If you inspect the network calls that are made when you hit load more
you can see that its a post request with the parameters being the city, state and the page number. So instead of loading the script in selnium, you can do it with normal requests module instead. For example, this function will do the load more function for you as you iterate through the pages.
def hitter(page):
url = "https://www.healthfrog.in/importlisting.html"
payload = "page="+str(page)+"&mcatid=chemists&keyword=medical-store&state=gujarat&city=surat"
headers = {
'content-type': "application/x-www-form-urlencoded",
'connection': "keep-alive",
'cache-control': "no-cache",
'postman-token': "d4baf979-343a-46e6-f53b-2f003d04da82"
}
response = requests.request("POST", url, data=payload, headers=headers)
return response.text
The above function fetches for you the html of the page which contains the name and address. Now you can iterate through the pages, until you find one which returns no content. For example, if try with state Karnataka and city as Mysore, you will notice the difference between the third and fourth pages. This will tell you where you have to stop.
To get the phone numbers, you can request for the html from the <h3>
tags of the bulk listing response (previous response). Example html:
<div class="listing">
<h3>
<a href="https://www.healthfrog.in/chemists/sunny-medical-store-surat-v8vcr3alr.html">Sunny Medical Store</a>
</h3>
<p>
<i class="fa fa-map-marker"></i>155 - Shiv Shakti Society, Punagam, , Surat, Gujarat- 394210,India
</p>
</div>
You will need to parse the html and find out where the phone number is, then you can populate it. You can request this example using:
html = requests.get('https://www.healthfrog.in/chemists/sunny-medical-store-surat-v8vcr3alr.html').text
You can now parse the html with beautifulSoup
like you have done earlier.
Doing it with requests instead of selenium has many advantages here, you need not open and close multiple windows for each time you need a phone number and you can avoid the element expiring each time you hit load more. Its also much faster.
Please note: If you are doing scraping like this, please abide by the rules set by the site. Do not crash it by sending too many requests.
Edit: Working scraper.
import requests, time, re
from bs4 import BeautifulSoup
def hitter(page, state="Gujarat", city="Surat"):
url = "https://www.healthfrog.in/importlisting.html"
payload = "page="+str(page)+"&mcatid=chemists&keyword=medical-store&state="+state+"&city="+city
headers = {
'content-type': "application/x-www-form-urlencoded",
'connection': "keep-alive",
'cache-control': "no-cache"
}
response = requests.request("POST", url, data=payload, headers=headers)
return response.text
def getPhoneNo(link):
time.sleep(3)
soup1 = BeautifulSoup(requests.get(link).text, "html.parser")
f = soup1.find('i', class_='fa fa-mobile').next_element
try:
phone = re.search(r'(\d{10})', f).group(1)
except AttributeError:
phone = None
return phone
def getChemists(soup):
stores = []
for row in soup.find_all("div", {"class":"listing"}):
print(row)
dummy = {
'name': row.h3.string,
'address': row.p.get_text(),
'phone': getPhoneNo(row.h3.a.get_attribute_list('href')[0])
}
print(dummy)
stores.append(dummy)
return stores
if __name__ == '__main__':
page, chemists = 1, []
city, state = 'Gulbarga', 'Karnataka'
html = hitter(page, state, city)
condition = not re.match(r'\A\s*\Z', html)
while(condition):
soup = BeautifulSoup(html, 'html.parser')
chemists += getChemists(soup)
page += 1
html = hitter(page, state, city)
condition = not re.match(r'\A\s*\Z', html)
print(chemists)
Upvotes: 1