Ravi
Ravi

Reputation: 65

I am scraping a site using selenium ,beautifulsoup. Need total no. of pages in website or other way to navigate pages

I am using selenium webdriver and beautiful soup to scrape a website which has a variable number of multiple pages. I am doing it crudely through xpath. A page shows five pages and after count is five I press the next button and reset the xpath count to get next 5 pages. For this I need total pages in the website through the code or a better way of navigating to different pages.

I think the page uses angular java script for navigation. The code is the following:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.maximize_window()
spg_index=' '
url = "https://www.bseindia.com/corporates/ann.html"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
html=soup.prettify()
with open('bseann.txt', 'w', encoding='utf-8') as f:
    f.write(html)
time.sleep(1)
i=1  #index for page numbers navigated. ket at maximum 31 at present
k=1  #goes upto 5, the maximum navigating pages shown at one time
while i <31:
    next_pg=9   #xpath number to pinpoint to "next" page 
    snext_pg=str(next_pg)
    snext_pg=snext_pg.strip()
    if i> 5:
        next_pg=10  #when we go to next set of pages thr is a addl option
        if(i==6) or(i==11)or(i==16):#resetting xpath indx for set of pg's
        k=2
        path='/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
        path=path+snext_pg+']/a'
        next_page_btn_list=driver.find_elements_by_xpath(path)
        next_page_btn=next_page_btn_list[0]
        next_page_btn.click()  #click next page
        time.sleep(1)
    pg_index= k+2
    spg_index=str(pg_index)
    spg_index=spg_index.strip()     
    path= '/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/ul/li['
    path=path+spg_index+']/a'
    next_page_btn_list=driver.find_elements_by_xpath(path)
    next_page_btn=next_page_btn_list[0]
    next_page_btn.click()  #click specific pg no. 
    time.sleep(1)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    html=soup.prettify()
    i=i+1
    k=k+1
    with open('bseann.txt', 'a', encoding='utf-8') as f:
        f.write(html)

Upvotes: 0

Views: 283

Answers (2)

chitown88
chitown88

Reputation: 28565

No need to use Selenium here as you can access the info from the API. This pulled 247 announcements:

import requests
from pandas.io.json import json_normalize

url = 'https://api.bseindia.com/BseIndiaAPI/api/AnnGetData/w'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}

payload = {
'strCat': '-1',
'strPrevDate': '20190423',
'strScrip': '',
'strSearch': 'P',
'strToDate': '20190423',
'strType': 'C'}

jsonData = requests.get(url, headers=headers, params=payload).json()

df = json_normalize(jsonData['Table'])
df['ATTACHMENTNAME'] = '=HYPERLINK("https://www.bseindia.com/xml-data/corpfiling/AttachLive/' + df['ATTACHMENTNAME'] + '")'


df.to_csv('C:/filename.csv', index=False)

Output:

...

GYSCOAL ALLOYS LTD. - 533275 - Announcement under Regulation 30 (LODR)-Code of Conduct under SEBI (PIT) Regulations, 2015
https://www.bseindia.com/xml-data/corpfiling/AttachLive/82f18673-de98-4a88-bbea-7d8499f25009.pdf

INDIAN SUCROSE LTD. - 500319 - Certificate Under Regulation 40(9) Of Listing Regulation For The Half Year Ended 31.03.2019
https://www.bseindia.com/xml-data/corpfiling/AttachLive/2539d209-50f6-4e56-a123-8562067d896e.pdf

Dhanvarsha Finvest Ltd - 540268 - Reply To Clarification Sought From The Company
https://www.bseindia.com/xml-data/corpfiling/AttachLive/f8d80466-af58-4336-b251-a9232db597cf.pdf

Prabhat Telecoms (India) Ltd - 540027 - Signing Of Framework Supply Agreement With METRO Cash & Carry India Private Limited
https://www.bseindia.com/xml-data/corpfiling/AttachLive/acfb1f72-efd3-4515-a583-2616d2942e78.pdf

...

Upvotes: 1

undetected Selenium
undetected Selenium

Reputation: 193088

A bit of more information about your usecase would have helped to answer your question. However to extract the information about the total number of pages within the website you can access the site, click on the item with text as Next and extract the required data and you can use the following solution:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_argument("start-maximized")
    options.add_argument("--disable-extensions")
    # options.add_argument('disable-infobars')
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get("https://www.bseindia.com/corporates/ann.html")
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[text()='Disclaimer']//following::div[1]//li[@class='pagination-last ng-scope']/a[@class='ng-binding' and text()='Last']"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//a[text()='Disclaimer']//following::div[1]//li[@class='pagination-page ng-scope active']/a[@class='ng-binding']"))).get_attribute("innerHTML"))
    
  • Console Output:

    17
    

Upvotes: 0

Related Questions