Andrija_Grozdanovic
Andrija_Grozdanovic

Reputation: 7

Beautiful soup parsing web page

I am trying to scrape the following web page: https://www.racingpost.com with BS. For example I want to extract all the Course names. Course names are under this tag:

<span class="rh-cardsMatrix__courseName">Wincanton</span>

My code is here:

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.racingpost.com"
response = requests.get(url)
data = response.text
soup =  BeautifulSoup(data, "html.parser")
pages = soup.find_all('span',{'class':'rh-cardsMatrix__courseName'})
for page in pages:
    print(page.text)

And I don't get anything for output. I think that it has some issues with parsing, and I have tried all available parsers for BS. Could someone advise here? Is it even possible to do with BS?

Upvotes: 0

Views: 122

Answers (3)

Andrija_Grozdanovic
Andrija_Grozdanovic

Reputation: 7

Thanks mattbasta for your answer, it directed me to this question which solved my problems : soup = BeautifulSoup(data, "html.parser") pages = soup.find_all('span',{'class':'rh-cardsMatrix__courseName'})

PyQt4 to PyQt5 -> mainFrame() deprecated, need fix to load web pages

Upvotes: 0

petezurich
petezurich

Reputation: 10214

The data you are looking for seems to be hidden in a script block at the end of the raw HTML.

You can try something like this:

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
from pandas import json_normalize

url = 'https://www.racingpost.com'
res = requests.get(url).text

raw = res.split('cardsMatrix":{"courses":')[1].split(',"date":"2020-03-06","heading":"Tomorrow\'s races"')[0]
data = json.loads(raw)
df = json_normalize(data)

Output:

id  abandoned   allWeather  surfaceType     colour  name    countryCode     meetingUrl  hashName    meetingTypeCode     races
0   1083    False   True    Polytrack   3   Chelmsford  GB  /racecards/1083/chelmsford-aw/2020-03-06    chelmsford-aw   Flat    [{'id': 753047, 'abandoned': False, 'result': ...
1   1212    False   False       4   Ffos Las    GB  /racecards/1212/ffos-las/2020-03-06     ffos-las    Jumps   [{'id': 750498, 'abandoned': False, 'result': ...
2   1138    False   True    Polytrack   11  Dundalk     IRE     /racecards/1138/dundalk-aw/2020-03-06   dundalk-aw  Flat    [{'id': 753023, 'abandoned': False, 'result': ...
3   513     False   True    Tapeta  5   Wolverhampton   GB  /racecards/513/wolverhampton-aw/2020-03-06  wolverhampton-aw    Flat    [{'id': 750658, 'abandoned': False, 'result': ...
4   565     False   False       0   Jebel Ali   UAE     /racecards/565/jebel-ali/2020-03-06     jebel-ali   Flat    [{'id': 753155, 'abandoned': False, 'result': ...
5   206     False   False       0   Deauville   FR  /racecards/206/deauville/2020-03-06     deauville   Flat    [{'id': 753186, 'abandoned': False, 'result': ...
6   54  True    False       1   Sandown     GB  /racecards/54/sandown/2020-03-06    sandown     Jumps   [{'id': 750510, 'abandoned': True, 'result': F...
7   30  True    False       2   Leicester   GB  /racecards/30/leicester/2020-03-06  leicester   Jumps   [{'id': 750501, 'abandoned': True, 'result': F...

Caveat: Be aware that you have to manually search for the string to properly split res at the end.

Edit: More robust solution.

To get the script block in total and parse from there try this code:

url = 'https://www.racingpost.com'
res = requests.get(url).content
soup = BeautifulSoup(res)

# salient data seems to be in 20th script block 
data = soup.find_all("script")[19].text
clean = data.split('window.__PRELOADED_STATE = ')[1].split(";\n")[0]
clean = json.loads(clean)
clean.keys()

Output:

['stories', 'bookmakers', 'panelTemplate', 'cardsMatrix', 'advertisement']

Then retrieve e.g. data saved to key cardsMatrix:

parsed = json_normalize(clean["cardsMatrix"]).courses.values[0]
pd.DataFrame(parsed)

Output again the above (but with more robust solution):

id  abandoned   allWeather  surfaceType     colour  name    countryCode     meetingUrl  hashName    meetingTypeCode     races
0   1083    False   True    Polytrack   3   Chelmsford  GB  /racecards/1083/chelmsford-aw/2020-03-06    chelmsford-aw   Flat    [{'id': 753047, 'abandoned': False, 'result': ...
1   1212    False   False       4   Ffos Las    GB  /racecards/1212/ffos-las/2020-03-06     ffos-las    Jumps   [{'id': 750498, 'abandoned': False, 'result': ...

Upvotes: 1

mattbasta
mattbasta

Reputation: 13709

Viewing the source code of https://www.racingpost.com, no elements have the classname rh-cardsMatrix__courseName. Querying for it on the page shows that it does exist when the page is rendered. This suggests that the elements with that classname are generated with JavaScript, which BeautifulSoup doesn't support (it doesn't run JavaScript).

You'll instead want to find the endpoints on the webpage that return the data that create those elements (e.g., look for XHRs for data) and use those to get the data that you need.

Upvotes: 0

Related Questions