Reputation: 65
I am using selenium webdriver and beautiful soup to scrape a website which has a variable number of multiple pages. I am doing it crudely through xpath
. A page shows five pages and after count is five I press the next button and reset the xpath
count to get next 5 pages. For this I need total pages in the website through the code or a better way of navigating to different pages.
I think the page uses angular java script for navigation. The code is the following:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.maximize_window()
spg_index=' '
url = "https://www.bseindia.com/corporates/ann.html"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
html=soup.prettify()
with open('bseann.txt', 'w', encoding='utf-8') as f:
f.write(html)
time.sleep(1)
i=1 #index for page numbers navigated. ket at maximum 31 at present
k=1 #goes upto 5, the maximum navigating pages shown at one time
while i <31:
next_pg=9 #xpath number to pinpoint to "next" page
snext_pg=str(next_pg)
snext_pg=snext_pg.strip()
if i> 5:
next_pg=10 #when we go to next set of pages thr is a addl option
if(i==6) or(i==11)or(i==16):#resetting xpath indx for set of pg's
k=2
path='/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
path=path+snext_pg+']/a'
next_page_btn_list=driver.find_elements_by_xpath(path)
next_page_btn=next_page_btn_list[0]
next_page_btn.click() #click next page
time.sleep(1)
pg_index= k+2
spg_index=str(pg_index)
spg_index=spg_index.strip()
path= '/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
path=path+spg_index+']/a'
next_page_btn_list=driver.find_elements_by_xpath(path)
next_page_btn=next_page_btn_list[0]
next_page_btn.click() #click specific pg no.
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'html.parser')
html=soup.prettify()
i=i+1
k=k+1
with open('bseann.txt', 'a', encoding='utf-8') as f:
f.write(html)
Upvotes: 0
Views: 283
Reputation: 28565
No need to use Selenium here as you can access the info from the API. This pulled 247 announcements:
import requests
from pandas.io.json import json_normalize
url = 'https://api.bseindia.com/BseIndiaAPI/api/AnnGetData/w'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
payload = {
'strCat': '-1',
'strPrevDate': '20190423',
'strScrip': '',
'strSearch': 'P',
'strToDate': '20190423',
'strType': 'C'}
jsonData = requests.get(url, headers=headers, params=payload).json()
df = json_normalize(jsonData['Table'])
df['ATTACHMENTNAME'] = '=HYPERLINK("https://www.bseindia.com/xml-data/corpfiling/AttachLive/' + df['ATTACHMENTNAME'] + '")'
df.to_csv('C:/filename.csv', index=False)
Output:
...
GYSCOAL ALLOYS LTD. - 533275 - Announcement under Regulation 30 (LODR)-Code of Conduct under SEBI (PIT) Regulations, 2015
https://www.bseindia.com/xml-data/corpfiling/AttachLive/82f18673-de98-4a88-bbea-7d8499f25009.pdf
INDIAN SUCROSE LTD. - 500319 - Certificate Under Regulation 40(9) Of Listing Regulation For The Half Year Ended 31.03.2019
https://www.bseindia.com/xml-data/corpfiling/AttachLive/2539d209-50f6-4e56-a123-8562067d896e.pdf
Dhanvarsha Finvest Ltd - 540268 - Reply To Clarification Sought From The Company
https://www.bseindia.com/xml-data/corpfiling/AttachLive/f8d80466-af58-4336-b251-a9232db597cf.pdf
Prabhat Telecoms (India) Ltd - 540027 - Signing Of Framework Supply Agreement With METRO Cash & Carry India Private Limited
https://www.bseindia.com/xml-data/corpfiling/AttachLive/acfb1f72-efd3-4515-a583-2616d2942e78.pdf
...
Upvotes: 1
Reputation: 193088
A bit of more information about your usecase would have helped to answer your question. However to extract the information about the total number of pages within the website you can access the site, click on the item with text as Next and extract the required data and you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--disable-extensions")
# options.add_argument('disable-infobars')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("https://www.bseindia.com/corporates/ann.html")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='Disclaimer']//following::div[1]//li[@class='pagination-last ng-scope']/a[@class='ng-binding' and text()='Last']"))).click()
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//a[text()='Disclaimer']//following::div[1]//li[@class='pagination-page ng-scope active']/a[@class='ng-binding']"))).get_attribute("innerHTML"))
Console Output:
17
Upvotes: 0