Reputation: 251
I've been learning the basics of Python for a short while, and thought I'd go ahead and try to put something together, but appear to have hit a stumbling block (despite looking just about everywhere to see where I may be going wrong).
I'm trying to grab a table i.e. from here: https://www.oddschecker.com/horse-racing/2020-09-10-chelmsford-city/20:30/winner
Now I realize that the table isn't set out how typically a normal HTML would be, and therefore trying to grab this with Pandas wouldn't yield results. Therefore delved into BeautifulSoup to try and get a result.
It seems all the data I would need is within the class 'diff-row evTabRow bc' and therefore wrote the following:
url = requests.get('https://www.oddschecker.com/horse-racing/2020-09-10-haydock/14:00/winner')
soup = BeautifulSoup(url.content, 'lxml')
table = soup.find_all("tr", class_="diff-row evTabRow bc")
This seems to put each horse and all corresponding data I'd need for it, into a list. Within this list, I'd only need certain bits, i.e. "data-name" for the horse name, and "data-odig" for the current odds.
I thought there may be some way I could then extract the data from the list to build a list of lists, and then construct a data frame in Pandas, but I may be going about this all wrong.
Upvotes: 5
Views: 435
Reputation: 369
The data you are looking for is both in the row tag <tr> and in the cell tags <td>.
The issue is that not all of the <td>'s are useful, so you have to skip those.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = requests.get('https://www.oddschecker.com/horse-racing/thirsk/13:00/winner')
soup = BeautifulSoup(url.content, 'lxml')
rows = soup.find_all("tr", class_="diff-row evTabRow bc")
my_data = []
for row in rows:
horse = row.attrs['data-bname']
for td in row:
if td.attrs['class'][0] != 'np':
continue #Skip
bookie = td['data-bk']
odds = td['data-odig']
my_data.append(dict(
horse = horse,
bookie = bookie,
odds = odds
))
df = pd.DataFrame(my_data)
print(df)
This will give you what you are looking for:
horse bookie odds
0 Just Frank B3 3.75
1 Just Frank SK 4.33
2 Just Frank WH 4.33
3 Just Frank EE 4.33
4 Just Frank FB 4.2
.. ... ... ...
268 Tommy R RZ 29
269 Tommy R SX 26
270 Tommy R BF 10.8
271 Tommy R MK 41
272 Tommy R MA 98
[273 rows x 3 columns]
Upvotes: 1
Reputation: 21264
You can access any of the <tr>
attributes with the BeautifulSoup object .attrs
property.
Once you have table
, loop over each entry, pull out the attributes you want as a list of dicts. Then initialize a Pandas data frame with the resulting list.
horse_attrs = list()
for entry in table:
attrs = dict(name=entry.attrs['data-bname'], dig=entry.attrs['data-best-dig'])
horse_attrs.append(attrs)
df = pd.DataFrame(horse_attrs)
df
name dig
0 Las Farras 9999
1 Heat Miami 9999
2 Martin Beck 9999
3 Litran 9999
4 Ritmo Capanga 9999
5 Perfect Score 9999
6 Simplemente Tuyo 9999
7 Anpacai 9999
8 Colt Fast 9999
9 Cacharpari 9999
10 Don Leparc 9999
11 Curioso Seattle 9999
12 Golpe Final 9999
13 El Acosador 9999
Notes:
data-name
and data-odig
) you mentioned, so I used ones with similar names. I don't know enough about horse racing to know if these are useful, but the method in this answer should allow you to choose any of the attributes that are available.Upvotes: 5
Reputation: 5331
If web-scraping, you can take the approach where you get your data stored as various variables:
l = []
for thing in elements:
var1 = ... # however you extract it
var2 = ...
l.append({'column1_name': var1, 'column2_name': var2})
df = pd.DataFrame(l)
How you select the data out of the HTML element is up to you (consider selecting td
?).
Upvotes: 0