Creating POST request to scrape website with python where no network form data changes

Question

I am scraping a website that dynamically renders with javascript. The urls don't change when hitting the > button So I have been trying to look at the inspector in the network section and more specifically the "General" section for the "Request Url" and the "Request Method" as well as in the "Form Data" section looking for any sort of ID that could be unique to distinguish each successive page. However when recording a log of clicking the > button from page to page the "Form Data" data seems to be the same each time (See images):

Currently my code doesn't incorporate this method because I can't see it helping until I can find a unique identifier in the "Form Data" section. However, I can show my code if helpful. In essence it just pulls the first page of data over and over again in my while loop even though I'm using a driver with selenium and using driver.find_elements_by_xpath("xpath of > button").click() before trying to get the data with BeautifulSoup.

(Updated code see comments)

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
from pandas import *
masters_list = []


def extract_info(html_source):
    # html_source will be inner HTMl of table
    global lst
    soup = BeautifulSoup(html_source, 'html.parser')
    lst = soup.find('tbody').find_all('tr')[0]
    masters_list.append(lst)

    # i am printing just id because it's id set as crypto name you have to do more scraping to get more info


chrome_driver_path = '/Users/Justin/Desktop/Python/chromedriver'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True

while loop:  # loop for extrcting all 120 pages
    crypto_table = driver.find_element(By.ID, 'DataTables_Table_0').get_attribute(
        'innerHTML')  # this is for crypto data table

    extract_info(crypto_table)

    paginate = driver.find_element(
        By.ID, "DataTables_Table_0_paginate")  # all table pagination
    pages_list = paginate.find_elements(By.TAG_NAME, 'li')
    # we clicking on next arrow sign at last not on 2,3,.. etc anchor link
    next_page_link = pages_list[-1].find_element(By.TAG_NAME, 'a')

    # checking is there next page available
    if "disabled" in next_page_link.get_attribute('class'):
        loop = False

    pages_list[-1].click()  # if there next page available then click on it
df = pd.DataFrame(masters_list)
print(df)
df.to_csv("crypto_list.csv")
driver.quit()

gaurav · Accepted Answer

I am using my own code to show how i am getting the table i add explanation as comment for important line

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

def extract_info(html_source):
    soup = BeautifulSoup(html_source,'html.parser') #html_source will be inner HTMl of table 
    lst = soup.find('tbody').find_all('tr')
    for i in lst:
        print(i.get('id')) # i am printing just id because it's id set as crypto name you have to do more scraping to get more info



driver = webdriver.Chrome()
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True

while loop: #loop for extrcting all 120 pages 
    crypto_table = driver.find_element(By.ID,'DataTables_Table_0').get_attribute('innerHTML') # this is for crypto data table 

    print(extract_info(crypto_table))

    paginate = driver.find_element(By.ID, "DataTables_Table_0_paginate") # all table pagination 
    pages_list  = paginate.find_elements(By.TAG_NAME,'li')
    next_page_link = pages_list[-1].find_element(By.TAG_NAME,'a') # we clicking on next arrow sign at last not on 2,3,.. etc anchor link

    if "disabled" in next_page_link.get_attribute('class'): # checking is there next page available 
        loop = False

    pages_list[-1].click() # if there next page available then click on it

so main answer of your question is when you click on button, selenium update the page then you can use driver.page_source to get updated html. some times (*not this url) page can have ajax request which can take some time so you have to wait till the selenium load the full page.

Creating POST request to scrape website with python where no network form data changes

Answers (1)

Related Questions