Reputation: 483
I am scraping a website that dynamically renders with javascript. The urls don't change when hitting the >
button So I have been trying to look at the inspector in the network section and more specifically the "General" section for the "Request Url" and the "Request Method" as well as in the "Form Data" section looking for any sort of ID that could be unique to distinguish each successive page. However when recording a log of clicking the >
button from page to page the "Form Data" data seems to be the same each time (See images):
Currently my code doesn't incorporate this method because I can't see it helping until I can find a unique identifier in the "Form Data" section. However, I can show my code if helpful. In essence it just pulls the first page of data over and over again in my while loop even though I'm using a driver with selenium and using driver.find_elements_by_xpath("xpath of > button").click()
before trying to get the data with BeautifulSoup.
(Updated code see comments)
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
from pandas import *
masters_list = []
def extract_info(html_source):
# html_source will be inner HTMl of table
global lst
soup = BeautifulSoup(html_source, 'html.parser')
lst = soup.find('tbody').find_all('tr')[0]
masters_list.append(lst)
# i am printing just id because it's id set as crypto name you have to do more scraping to get more info
chrome_driver_path = '/Users/Justin/Desktop/Python/chromedriver'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True
while loop: # loop for extrcting all 120 pages
crypto_table = driver.find_element(By.ID, 'DataTables_Table_0').get_attribute(
'innerHTML') # this is for crypto data table
extract_info(crypto_table)
paginate = driver.find_element(
By.ID, "DataTables_Table_0_paginate") # all table pagination
pages_list = paginate.find_elements(By.TAG_NAME, 'li')
# we clicking on next arrow sign at last not on 2,3,.. etc anchor link
next_page_link = pages_list[-1].find_element(By.TAG_NAME, 'a')
# checking is there next page available
if "disabled" in next_page_link.get_attribute('class'):
loop = False
pages_list[-1].click() # if there next page available then click on it
df = pd.DataFrame(masters_list)
print(df)
df.to_csv("crypto_list.csv")
driver.quit()
Upvotes: 0
Views: 592
Reputation: 1337
I am using my own code to show how i am getting the table i add explanation as comment for important line
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
def extract_info(html_source):
soup = BeautifulSoup(html_source,'html.parser') #html_source will be inner HTMl of table
lst = soup.find('tbody').find_all('tr')
for i in lst:
print(i.get('id')) # i am printing just id because it's id set as crypto name you have to do more scraping to get more info
driver = webdriver.Chrome()
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True
while loop: #loop for extrcting all 120 pages
crypto_table = driver.find_element(By.ID,'DataTables_Table_0').get_attribute('innerHTML') # this is for crypto data table
print(extract_info(crypto_table))
paginate = driver.find_element(By.ID, "DataTables_Table_0_paginate") # all table pagination
pages_list = paginate.find_elements(By.TAG_NAME,'li')
next_page_link = pages_list[-1].find_element(By.TAG_NAME,'a') # we clicking on next arrow sign at last not on 2,3,.. etc anchor link
if "disabled" in next_page_link.get_attribute('class'): # checking is there next page available
loop = False
pages_list[-1].click() # if there next page available then click on it
so main answer of your question is when you click on button, selenium update the page then you can use driver.page_source
to get updated html. some times (*not this url) page can have ajax request which can take some time so you have to wait till the selenium load the full page.
Upvotes: 1