nomore
nomore

Reputation: 41

Scraping using beautifulsoup when data split between pages

I am trying to scrape part of a website as I would like to transfer it to excel for easier manipulation.

The website is this link

My code works fine for the first page of data but as you can see, the list spans several pages and to access those pages, &page=#number of page needs to be added to the address. I was thinking I could iterate my code and append the elements to the panda array. However, I can't find how to detect the last page?

Is this how it is done when data is split between several pages with? Thanks for you help.

import requests
import pandas as pd
from bs4 import BeautifulSoup

pd.set_option('display.max_colwidth', -1)
pd.options.display.float_format = '{:,.2f}'.format

url = "https://www.boursorama.com/bourse/produits-de-bourse/levier/warrants/resultats?\
warrant_filter%5Bnature%5D=1&\
warrant_filter%5BunderlyingType%5D=&\
warrant_filter%5BunderlyingName%5D=TESLA&\
warrant_filter%5Bmaturity%5D=0&\
warrant_filter%5BdeltaMin%5D=&\
warrant_filter%5BdeltaMax%5D=&\
warrant_filter%5Bissuer%5D=&\
warrant_filter%5Bsearch%5D="

def parse_html_table(table):
        n_columns = 0
        n_rows=0
        column_names = []

        # Find number of rows and columns
        # we also find the column titles if we can
        for row in table.find_all('tr'):

            # Determine the number of rows in the table
            td_tags = row.find_all('td')
            if len(td_tags) > 0:
                n_rows+=1
                if n_columns == 0:
                    # Set the number of columns for our table
                    n_columns = len(td_tags)

            # Handle column names if we find them
            th_tags = row.find_all('th')
            if len(th_tags) > 0 and len(column_names) == 0:
                for th in th_tags:
                    column_names.append(th.get_text())

        # Safeguard on Column Titles
        if len(column_names) > 0 and len(column_names) != n_columns:
            raise Exception("Column titles do not match the number of columns")

        columns = column_names if len(column_names) > 0 else range(0,n_columns)
        df = pd.DataFrame(columns = columns,
                          index= range(0,n_rows))
        row_marker = 0
        for row in table.find_all('tr'):
            column_marker = 0
            columns = row.find_all('td')
            for column in columns:
                df.iat[row_marker,column_marker] = column.get_text()
                column_marker += 1
            if len(columns) > 0:
                row_marker += 1

        # Convert to float if possible
        for col in df:
            try:
                df[col] = df[col].astype(float)
            except ValueError:
                pass

        return df


response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
#import pdb; pdb.set_trace()

table=soup.find_all('table')[0]
df=parse_html_table(table)
df=df.replace({'\n': ''}, regex=True)

Upvotes: 0

Views: 89

Answers (2)

r-beginners
r-beginners

Reputation: 35205

Normally I would get the last page and get all the pages, but this site didn't let me get the last page. This process will be done after checking the last page. pandas.read_html is very easy.

 import pandas as pd
 import requests
 from bs4 import BeautifulSoup

 url = "https://www.boursorama.com/bourse/produits-de-bourse/levier/warrants/resultats?\
 warrant_filter%5Bnature%5D=1&\
 warrant_filter%5BunderlyingType%5D=&\
 warrant_filter%5BunderlyingName%5D=TESLA&\
 warrant_filter%5Bmaturity%5D=0&\
 warrant_filter%5BdeltaMin%5D=&\
 warrant_filter%5BdeltaMax%5D=&\
 warrant_filter%5Bissuer%5D=&\
 warrant_filter%5Bsearch%5D="

frames = []
i = 0
for i in range(19):
     r = requests.get(url+'page={}'.format(i))
     df_list = pd.read_html(r.text)
     df = df_list[0]
     frames.append(df)
     i += 1
res = pd.concat(frames, ignore_index=True)

Upvotes: 1

ilyankou
ilyankou

Reputation: 1329

Why don't you get the last pagination link (either >> or in your example url 8), and extract the final page from the href attribute? Like that:

pagination_links = soup.findAll("a", {"class" : "c-pagination__link"})
last_page = pagination_links[-1]['href'].split('page=')[-1]

Upvotes: 1

Related Questions