Reputation: 7
I am trying to scrape the following web page: https://www.racingpost.com
with BS.
For example I want to extract all the Course names. Course names are under this tag:
<span class="rh-cardsMatrix__courseName">Wincanton</span>
My code is here:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.racingpost.com"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, "html.parser")
pages = soup.find_all('span',{'class':'rh-cardsMatrix__courseName'})
for page in pages:
print(page.text)
And I don't get anything for output. I think that it has some issues with parsing, and I have tried all available parsers for BS. Could someone advise here? Is it even possible to do with BS?
Upvotes: 0
Views: 122
Reputation: 7
Thanks mattbasta for your answer, it directed me to this question which solved my problems : soup = BeautifulSoup(data, "html.parser") pages = soup.find_all('span',{'class':'rh-cardsMatrix__courseName'})
PyQt4 to PyQt5 -> mainFrame() deprecated, need fix to load web pages
Upvotes: 0
Reputation: 10214
The data you are looking for seems to be hidden in a script block at the end of the raw HTML.
You can try something like this:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
from pandas import json_normalize
url = 'https://www.racingpost.com'
res = requests.get(url).text
raw = res.split('cardsMatrix":{"courses":')[1].split(',"date":"2020-03-06","heading":"Tomorrow\'s races"')[0]
data = json.loads(raw)
df = json_normalize(data)
Output:
id abandoned allWeather surfaceType colour name countryCode meetingUrl hashName meetingTypeCode races
0 1083 False True Polytrack 3 Chelmsford GB /racecards/1083/chelmsford-aw/2020-03-06 chelmsford-aw Flat [{'id': 753047, 'abandoned': False, 'result': ...
1 1212 False False 4 Ffos Las GB /racecards/1212/ffos-las/2020-03-06 ffos-las Jumps [{'id': 750498, 'abandoned': False, 'result': ...
2 1138 False True Polytrack 11 Dundalk IRE /racecards/1138/dundalk-aw/2020-03-06 dundalk-aw Flat [{'id': 753023, 'abandoned': False, 'result': ...
3 513 False True Tapeta 5 Wolverhampton GB /racecards/513/wolverhampton-aw/2020-03-06 wolverhampton-aw Flat [{'id': 750658, 'abandoned': False, 'result': ...
4 565 False False 0 Jebel Ali UAE /racecards/565/jebel-ali/2020-03-06 jebel-ali Flat [{'id': 753155, 'abandoned': False, 'result': ...
5 206 False False 0 Deauville FR /racecards/206/deauville/2020-03-06 deauville Flat [{'id': 753186, 'abandoned': False, 'result': ...
6 54 True False 1 Sandown GB /racecards/54/sandown/2020-03-06 sandown Jumps [{'id': 750510, 'abandoned': True, 'result': F...
7 30 True False 2 Leicester GB /racecards/30/leicester/2020-03-06 leicester Jumps [{'id': 750501, 'abandoned': True, 'result': F...
Caveat: Be aware that you have to manually search for the string to properly split res
at the end.
Edit: More robust solution.
To get the script block in total and parse from there try this code:
url = 'https://www.racingpost.com'
res = requests.get(url).content
soup = BeautifulSoup(res)
# salient data seems to be in 20th script block
data = soup.find_all("script")[19].text
clean = data.split('window.__PRELOADED_STATE = ')[1].split(";\n")[0]
clean = json.loads(clean)
clean.keys()
Output:
['stories', 'bookmakers', 'panelTemplate', 'cardsMatrix', 'advertisement']
Then retrieve e.g. data saved to key cardsMatrix
:
parsed = json_normalize(clean["cardsMatrix"]).courses.values[0]
pd.DataFrame(parsed)
Output again the above (but with more robust solution):
id abandoned allWeather surfaceType colour name countryCode meetingUrl hashName meetingTypeCode races
0 1083 False True Polytrack 3 Chelmsford GB /racecards/1083/chelmsford-aw/2020-03-06 chelmsford-aw Flat [{'id': 753047, 'abandoned': False, 'result': ...
1 1212 False False 4 Ffos Las GB /racecards/1212/ffos-las/2020-03-06 ffos-las Jumps [{'id': 750498, 'abandoned': False, 'result': ...
Upvotes: 1
Reputation: 13709
Viewing the source code of https://www.racingpost.com
, no elements have the classname rh-cardsMatrix__courseName
. Querying for it on the page shows that it does exist when the page is rendered. This suggests that the elements with that classname are generated with JavaScript, which BeautifulSoup doesn't support (it doesn't run JavaScript).
You'll instead want to find the endpoints on the webpage that return the data that create those elements (e.g., look for XHRs for data) and use those to get the data that you need.
Upvotes: 0