Reputation: 107
I am trying to scrape a website for a table but only the header is being returned.
I am new to python and web scraping and have followed the following material which was very helpful https://medium.com/analytics-vidhya/how-to-scrape-a-table-from-website-using-python-ce90d0cfb607.
However, the following code only returns the header and not the body of the table.
# Create an URL object
url = 'https://www.dividendmax.com/dividends/declared'
# Create object page
page = requests.get(url)
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
# Obtain information from tag <table>
table1 = soup.find_all('table')
table1
Output:
[<table aria-label="Declared Dividends" class="mdc-data-table__table">
<thead>
<tr class="mdc-data-table__header-row">
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Company</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Ticker</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Country</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Exchange</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Share Price</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Prev. Dividend</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Next Dividend</th>
<th class="mdc-data-table__header-cell" role="columnheader" scope="col">Next Ex-date</th>
</tr>
</thead>
<tbody></tbody>
</table>]
I need to retrieve the tbody content (found when expanding the penultimate row of output).
Just as an FYI, the following code will be used to create the dataframe.
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
# Create a dataframe
mydata = pd.DataFrame(columns = headers)
# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
Upvotes: 0
Views: 250
Reputation: 283
This website you're scraping has an api to fetch the data that populates the table. You're sending a request and getting back the html skeleton for the page without the table data having been populated yet.
If you go to inspect page and go to the network tab of your browser, watch for fetch/xhr requests. You should see a request go to: https://www.dividendmax.com/dividends/declared.json?region=1 If you preview the results of that request, you'll see that is the info you want. You can directly query for that data by sending a request to that url:
page = requests.get("https://www.dividendmax.com/dividends/declared.json?region=1")
page.json()
Upvotes: 0
Reputation: 28565
The page you are after is not the same as the tutorial. Probably not the best site if your trying to learn/practice with beautifulsoup. But the data for me comes back in a nice json format.
import requests
import pandas as pd
# Create an URL object
url = 'https://www.dividendmax.com/dividends/declared'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
jsonData = requests.get(url, headers=headers).json()
df = pd.DataFrame(jsonData)
Output:
print(df)
name ... ind
0 3i Group plc ... [22, 25, 23, 3, 5]
1 3I Infrastructure Plc ... [4, 5]
2 AB Dynamics plc ... []
3 Aberdeen Smaller Companies Income Trust plc ... []
4 Aberdeen Standard Equity Income Trust plc ... []
.. ... ... ...
146 Workspace Group ... [25, 4, 24, 5]
147 Wynnstay Properties ... []
148 XP Power Ltd ... [5, 4]
149 Yew Grove REIT Plc ... []
150 Yougov ... []
[151 rows x 11 columns]
Upvotes: 2