BD12
BD12

Reputation: 107

Web scraping with BeautifulSoup - when trying to find table the content is not returned

I am trying to scrape a website for a table but only the header is being returned.

I am new to python and web scraping and have followed the following material which was very helpful https://medium.com/analytics-vidhya/how-to-scrape-a-table-from-website-using-python-ce90d0cfb607.

However, the following code only returns the header and not the body of the table.

# Create an URL object
url = 'https://www.dividendmax.com/dividends/declared'
# Create object page
page = requests.get(url)

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

# Obtain information from tag <table>
table1 = soup.find_all('table')
table1

Output:

[<table aria-label="Declared Dividends" class="mdc-data-table__table">
 <thead>
 <tr class="mdc-data-table__header-row">
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Company</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Ticker</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Country</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Exchange</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Share Price</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Prev. Dividend</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Next Dividend</th>
 <th class="mdc-data-table__header-cell" role="columnheader" scope="col">Next Ex-date</th>
 </tr>
 </thead>
 <tbody></tbody>
 </table>]

I need to retrieve the tbody content (found when expanding the penultimate row of output).

Just as an FYI, the following code will be used to create the dataframe.

# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
    title = i.text
    headers.append(title)

# Create a dataframe
mydata = pd.DataFrame(columns = headers)

# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(mydata)
    mydata.loc[length] = row

Upvotes: 0

Views: 250

Answers (2)

tbrk
tbrk

Reputation: 283

This website you're scraping has an api to fetch the data that populates the table. You're sending a request and getting back the html skeleton for the page without the table data having been populated yet.

If you go to inspect page and go to the network tab of your browser, watch for fetch/xhr requests. You should see a request go to: https://www.dividendmax.com/dividends/declared.json?region=1 If you preview the results of that request, you'll see that is the info you want. You can directly query for that data by sending a request to that url:

page = requests.get("https://www.dividendmax.com/dividends/declared.json?region=1")
page.json()

Upvotes: 0

chitown88
chitown88

Reputation: 28565

The page you are after is not the same as the tutorial. Probably not the best site if your trying to learn/practice with beautifulsoup. But the data for me comes back in a nice json format.

import requests
import pandas as pd

# Create an URL object
url = 'https://www.dividendmax.com/dividends/declared'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}

jsonData = requests.get(url, headers=headers).json()
df = pd.DataFrame(jsonData)

Output:

print(df)
                                            name  ...                 ind
0                                   3i Group plc  ...  [22, 25, 23, 3, 5]
1                          3I Infrastructure Plc  ...              [4, 5]
2                                AB Dynamics plc  ...                  []
3    Aberdeen Smaller Companies Income Trust plc  ...                  []
4      Aberdeen Standard Equity Income Trust plc  ...                  []
..                                           ...  ...                 ...
146                              Workspace Group  ...      [25, 4, 24, 5]
147                          Wynnstay Properties  ...                  []
148                                 XP Power Ltd  ...              [5, 4]
149                           Yew Grove REIT Plc  ...                  []
150                                       Yougov  ...                  []

[151 rows x 11 columns]

Upvotes: 2

Related Questions