Reputation: 11
I'm trying to pull data off the table called "Fuel Mix Graph" on this site: https://www.iso-ne.com/isoexpress/ I am using BeautifulSoup to read the HTML and pull off the table listed below, but when I try to read the contents of tbody, it outputs it as empty.
Here is my code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
pullPage = 'https://www.iso-ne.com/isoexpress/'
#query website and assign HTML to var page
page = urlopen(pullPage)
#parse HTML into var soup
soup = BeautifulSoup(page, 'html.parser')
#take <div> out of HTML name classifier and obtain value
fuelMix = soup.find('div', id='p_p_id_fuelmixgraphportlet_WAR_isoneportlet_INSTANCE_ZXnKx0ygssKj_')
fuelMixData = fuelMix.find('table', id = '_fuelmixgraphportlet_WAR_isoneportlet_INSTANCE_ZXnKx0ygssKj_table')
tbody = fuelMixData.find_all('tbody')
#for row in rows:
# data = row.find_all('td')
#FMData.append(str(row.find_all('tr')[0].text))
print (tbody)
and here is the relevant section of the HTML:
<table id="_fuelmixgraphportlet_WAR_isoneportlet_INSTANCE_ZXnKx0ygssKj_table" align="left">
<thead>
<tr>
<th style="text-align:left;">Date/Time</th>
<th style="text-align:left;">Fuel</th>
<th>MW</th> </tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">06/02/2019 00:01</td>
<td style="text-align:left;">NaturalGas</td>
<td>2581</td>
</tr>
<tr>
<td style="text-align:left;">06/02/2019 00:01</td>
<td style="text-align:left;">Nuclear</td>
<td>3339</td>
</tr>
</tbody>
</table>
For now, my expected results are to simply print all of the data in tbody. Eventually I will read 'tr' and 'td' to create arrays of the data (any ideas as to how to clean up the other strings that are not the date/time, fuel type, and value would be appreciated as well!)
When I run the current code, it will only return
[<tbody></tbody>]
If I find_all('tr'), it only returns the values from thead:
[<tr> <th style="text-align:left;">Date/Time</th> <th style="text-align:left;">Fuel</th> <th>MW</th> </tr>]
And if I find_all('td'), an empty array is returned.
Thank you for your help in advance.
Upvotes: 1
Views: 1018
Reputation: 84465
Mimic the POST request the page does and you get all that info in json format
from bs4 import BeautifulSoup as bs
import requests
import time
params = {
'_nstmp_formDate' : int(time.time()),
'_nstmp_startDate' : '06/02/2019',
'_nstmp_endDate' : '06/02/2019',
'_nstmp_twodays' : 'false',
'_nstmp_chartTitle' : 'Fuel Mix Graph',
'_nstmp_requestType' : 'genfuelmix',
'_nstmp_fuelType' : 'all',
'_nstmp_height' : 250,
'_nstmp_showtwodays' : 'false'
}
r = requests.post('https://www.iso-ne.com/ws/wsclient', data = params).json()
Writing out to df for example:
from bs4 import BeautifulSoup as bs
import requests
import time
import pandas as pd
params = {
'_nstmp_formDate' : int(time.time()),
'_nstmp_startDate' : '06/02/2019',
'_nstmp_endDate' : '06/02/2019',
'_nstmp_twodays' : 'false',
'_nstmp_chartTitle' : 'Fuel Mix Graph',
'_nstmp_requestType' : 'genfuelmix',
'_nstmp_fuelType' : 'all',
'_nstmp_height' : 250,
'_nstmp_showtwodays' : 'false'
}
r = requests.post('https://www.iso-ne.com/ws/wsclient', data = params).json()
result = []
headers = ['NaturalGas', 'Wind', 'Nuclear', 'Solar', 'Wood', 'Refuse', 'LandfillGas', 'BeginDateMs', 'Renewables', 'BeginDate', 'Hydro', 'Other']
for item in r[0]['data']:
row = {}
for header in headers:
row[header] = item.get(header, '')
result.append(row)
df = pd.DataFrame(result, columns = headers)
print(df.head())
Upvotes: 2