Reputation: 41
I am trying to scrape part of a website as I would like to transfer it to excel for easier manipulation.
The website is this link
My code works fine for the first page of data but as you can see, the list spans several pages and to access those pages, &page=#number of page
needs to be added to the address. I was thinking I could iterate my code and append the elements to the panda array. However, I can't find how to detect the last page?
Is this how it is done when data is split between several pages with? Thanks for you help.
import requests
import pandas as pd
from bs4 import BeautifulSoup
pd.set_option('display.max_colwidth', -1)
pd.options.display.float_format = '{:,.2f}'.format
url = "https://www.boursorama.com/bourse/produits-de-bourse/levier/warrants/resultats?\
warrant_filter%5Bnature%5D=1&\
warrant_filter%5BunderlyingType%5D=&\
warrant_filter%5BunderlyingName%5D=TESLA&\
warrant_filter%5Bmaturity%5D=0&\
warrant_filter%5BdeltaMin%5D=&\
warrant_filter%5BdeltaMax%5D=&\
warrant_filter%5Bissuer%5D=&\
warrant_filter%5Bsearch%5D="
def parse_html_table(table):
n_columns = 0
n_rows=0
column_names = []
# Find number of rows and columns
# we also find the column titles if we can
for row in table.find_all('tr'):
# Determine the number of rows in the table
td_tags = row.find_all('td')
if len(td_tags) > 0:
n_rows+=1
if n_columns == 0:
# Set the number of columns for our table
n_columns = len(td_tags)
# Handle column names if we find them
th_tags = row.find_all('th')
if len(th_tags) > 0 and len(column_names) == 0:
for th in th_tags:
column_names.append(th.get_text())
# Safeguard on Column Titles
if len(column_names) > 0 and len(column_names) != n_columns:
raise Exception("Column titles do not match the number of columns")
columns = column_names if len(column_names) > 0 else range(0,n_columns)
df = pd.DataFrame(columns = columns,
index= range(0,n_rows))
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
df.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
if len(columns) > 0:
row_marker += 1
# Convert to float if possible
for col in df:
try:
df[col] = df[col].astype(float)
except ValueError:
pass
return df
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
#import pdb; pdb.set_trace()
table=soup.find_all('table')[0]
df=parse_html_table(table)
df=df.replace({'\n': ''}, regex=True)
Upvotes: 0
Views: 89
Reputation: 35205
Normally I would get the last page and get all the pages, but this site didn't let me get the last page. This process will be done after checking the last page. pandas.read_html is very easy.
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.boursorama.com/bourse/produits-de-bourse/levier/warrants/resultats?\
warrant_filter%5Bnature%5D=1&\
warrant_filter%5BunderlyingType%5D=&\
warrant_filter%5BunderlyingName%5D=TESLA&\
warrant_filter%5Bmaturity%5D=0&\
warrant_filter%5BdeltaMin%5D=&\
warrant_filter%5BdeltaMax%5D=&\
warrant_filter%5Bissuer%5D=&\
warrant_filter%5Bsearch%5D="
frames = []
i = 0
for i in range(19):
r = requests.get(url+'page={}'.format(i))
df_list = pd.read_html(r.text)
df = df_list[0]
frames.append(df)
i += 1
res = pd.concat(frames, ignore_index=True)
Upvotes: 1
Reputation: 1329
Why don't you get the last pagination link (either >>
or in your example url 8
), and extract the final page from the href attribute? Like that:
pagination_links = soup.findAll("a", {"class" : "c-pagination__link"})
last_page = pagination_links[-1]['href'].split('page=')[-1]
Upvotes: 1