user994319
user994319

Reputation: 251

Extracting data from list in Python, after BeautifulSoup scrape, and creating Pandas table

I've been learning the basics of Python for a short while, and thought I'd go ahead and try to put something together, but appear to have hit a stumbling block (despite looking just about everywhere to see where I may be going wrong).

I'm trying to grab a table i.e. from here: https://www.oddschecker.com/horse-racing/2020-09-10-chelmsford-city/20:30/winner

Now I realize that the table isn't set out how typically a normal HTML would be, and therefore trying to grab this with Pandas wouldn't yield results. Therefore delved into BeautifulSoup to try and get a result.

It seems all the data I would need is within the class 'diff-row evTabRow bc' and therefore wrote the following:

url = requests.get('https://www.oddschecker.com/horse-racing/2020-09-10-haydock/14:00/winner')
soup = BeautifulSoup(url.content, 'lxml')
table = soup.find_all("tr", class_="diff-row evTabRow bc")

This seems to put each horse and all corresponding data I'd need for it, into a list. Within this list, I'd only need certain bits, i.e. "data-name" for the horse name, and "data-odig" for the current odds.

I thought there may be some way I could then extract the data from the list to build a list of lists, and then construct a data frame in Pandas, but I may be going about this all wrong.

Upvotes: 5

Views: 435

Answers (3)

TKK
TKK

Reputation: 369

The data you are looking for is both in the row tag <tr> and in the cell tags <td>.

The issue is that not all of the <td>'s are useful, so you have to skip those.

import pandas as pd

from bs4 import BeautifulSoup
import requests

url   = requests.get('https://www.oddschecker.com/horse-racing/thirsk/13:00/winner')
soup  = BeautifulSoup(url.content, 'lxml')
rows = soup.find_all("tr", class_="diff-row evTabRow bc")

my_data = []
for row in rows:
    horse = row.attrs['data-bname']

    for td in row:
        if td.attrs['class'][0] != 'np':
            continue #Skip

        bookie = td['data-bk']
        odds   = td['data-odig']
        my_data.append(dict(
            horse  = horse,
            bookie = bookie,
            odds   = odds
        ))

df = pd.DataFrame(my_data)
print(df)

This will give you what you are looking for:

          horse bookie  odds
0    Just Frank     B3  3.75
1    Just Frank     SK  4.33
2    Just Frank     WH  4.33
3    Just Frank     EE  4.33
4    Just Frank     FB   4.2
..          ...    ...   ...
268     Tommy R     RZ    29
269     Tommy R     SX    26
270     Tommy R     BF  10.8
271     Tommy R     MK    41
272     Tommy R     MA    98

[273 rows x 3 columns]

Upvotes: 1

andrew_reece
andrew_reece

Reputation: 21264

You can access any of the <tr> attributes with the BeautifulSoup object .attrs property.

Once you have table, loop over each entry, pull out the attributes you want as a list of dicts. Then initialize a Pandas data frame with the resulting list.

horse_attrs = list()

for entry in table:
    attrs = dict(name=entry.attrs['data-bname'], dig=entry.attrs['data-best-dig'])
    horse_attrs.append(attrs)

df = pd.DataFrame(horse_attrs)

df
                name   dig
0         Las Farras  9999
1         Heat Miami  9999
2        Martin Beck  9999
3             Litran  9999
4      Ritmo Capanga  9999
5      Perfect Score  9999
6   Simplemente Tuyo  9999
7            Anpacai  9999
8          Colt Fast  9999
9         Cacharpari  9999
10        Don Leparc  9999
11   Curioso Seattle  9999
12       Golpe Final  9999
13       El Acosador  9999

Notes:

  • The url you provided didn't work for me, but this similar one did: https://www.oddschecker.com/horse-racing/palermo-arg/21:00/winner
  • I didn't see the exact attributes (data-name and data-odig) you mentioned, so I used ones with similar names. I don't know enough about horse racing to know if these are useful, but the method in this answer should allow you to choose any of the attributes that are available.

Upvotes: 5

ifly6
ifly6

Reputation: 5331

If web-scraping, you can take the approach where you get your data stored as various variables:

l = []
for thing in elements:
    var1 = ...  # however you extract it
    var2 = ...

    l.append({'column1_name': var1, 'column2_name': var2})

df = pd.DataFrame(l)

How you select the data out of the HTML element is up to you (consider selecting td?).

Upvotes: 0

Related Questions