Reputation: 394
I'm trying to get data from this URL into a format suitable for Excel but am stuck. With this code I've managed to get the data into rows but for some reason they don't correspond to the row #'s. Can anyone help?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
#--------------------------------------------------------------------------------------------------------------------------------------------------#
url = 'http://rotoguru1.com/cgi-bin/hoopstat-daterange.pl?startdate=20181021&date=20181021&saldate=20181021&g=0&ha=&min=&tmptmin=0&tmptmax=999&opptmin=0&opptmax=999&gmptmin=0&gmptmax=999&gameday=&sd=0'
#--------------------------------------------------------------------------------------------------------------------------------------------------#
page_request = requests.get(url)
soup = BeautifulSoup(page_request.text,'lxml')
data = []
for br in soup.findAll('br')[3:][:-1]:
data.append(br.nextSibling)
data_df = pd.DataFrame(data)
print(data_df)
print results:
0
0
4943;Abrines, Alex;0;Abrines, Alex;okc;1;0;5....
1
5709;Adams, Jaylen;0;Adams, Jaylen;atl;1;0;0....
2
4574;Adams, Steven;2991235;Adams, Steven;okc;...
3
5696;Akoon-Purcell, DeVaughn;0;Akoon-Purcell,...
4
4860;Anderson, Justin;0;Anderson, Justin;atl;...
5
3510;Anthony, Carmelo;1975;Anthony, Carmelo;h...
Upvotes: 0
Views: 50
Reputation: 774
I believe the reason behind the last row of your DataFrame
being empty is because of your parser. At the last position in the list, it still checks the next sibling after a break and appends an empty space into your DataFrame
. This would do the trick:
for br in soup.findAll('br')[3:][:-1]:
contents = br.nextSibling
if not contents == "\n":
data.append(br.nextSibling)
Upvotes: 1