Reputation: 11
I am trying to scrape this table https://momentranks.com/topshot/account/mariodustice?limit=250
I have tried this:
import requests
from bs4 import BeautifulSoup
url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
table = soup.find_all('table', attrs={'class':'Table_tr__1JI4P'})
But it returns an empty list. Can someone give advice on how to approach this?
Upvotes: 1
Views: 169
Reputation: 46759
The source HTML you see in your browser is rendered using javascript. When you use requests
this does not happen which is why your script is not working. If you print the HTML that is returned, it will not contain the information you wanted.
All of the information is though available via the API which your browser makes calls to to build the page. You will need to take a detailed look at the JSON data structure returned to decide which information you wish to extract.
The following example shows how to get a list of the names and MRvalue of each player:
import requests
from bs4 import BeautifulSoup
import json
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
req_main = s.get(url, headers=headers)
soup = BeautifulSoup(req_main.content, 'lxml')
data = soup.find('script', id='__NEXT_DATA__')
json_data = json.loads(data.string)
account = json_data['props']['pageProps']['account']['flowAddress']
post_data = {"flowAddress" : account,"filters" : {"page" : 0, "limit":"250", "type":"moments"}}
req_json = s.post('https://momentranks.com/api/account/details', headers=headers, data=post_data)
player_data = req_json.json()
for player in player_data['data']:
name = player['moment']['playerName']
mrvalue = player['MRvalue']
print(f"{name:30} ${mrvalue:.02f}")
Giving you output starting:
Scottie Barnes $672.38
Cade Cunningham $549.00
Josh Giddey $527.11
Franz Wagner $439.26
Trae Young $429.51
A'ja Wilson $387.07
Ja Morant $386.00
The flowAddress
is needed from the first page request to allow the API to be used correctly. This happens to be embedded in a <script>
section at the bottom of the HTML.
All of this was worked out by using the browser's network tools to watch how the actual webpage made requests to the server to build its page.
Upvotes: 0
Reputation: 28564
Selenium is a bit overkill when there is an available api. Just get the data directly:
import requests
import pandas as pd
url = 'https://momentranks.com/api/account/details'
rows = []
page = 0
while True:
payload = {
'filters': {'page': '%s' %page, 'limit': "250", 'type': "moments"},
'flowAddress': "f64f1763e61e4087"}
jsonData = requests.post(url, json=payload).json()
data = jsonData['data']
rows += data
print('%s of %s' %(len(rows),jsonData['totalCount'] ))
if len(rows) == jsonData['totalCount']:
break
page += 1
df = pd.DataFrame(rows)
Output:
print(df)
_id flowId ... challenges priceFloor
0 619d2f82fda908ecbe74b607 24001245 ... NaN NaN
1 61ba30837c1f070eadc0f8e4 25651781 ... NaN NaN
2 618d87b290209c5a51128516 21958292 ... NaN NaN
3 61aea763fda908ecbe9e8fbf 25201655 ... NaN NaN
4 60c38188e245f89daf7c4383 15153366 ... NaN NaN
... ... ... ... ...
1787 61d0a2c37c1f070ead6b10a8 27014524 ... NaN NaN
1788 61d0a2c37c1f070ead6b10a8 27025557 ... NaN NaN
1789 61e9fafcd8acfcf57792dc5d 28711771 ... NaN NaN
1790 61ef40fcd8acfcf577273709 28723650 ... NaN NaN
1791 616a6dcb14bfee6c9aba30f9 18394076 ... NaN NaN
[1792 rows x 40 columns]
Upvotes: 1
Reputation: 11
The data is indexed into the page using js code you cant use requests alone however you can use selenium Keep in mind that Selenium's driver.get dosnt wait for the page to completley load which means you need to wait
Here to get you started with selenium
url = 'https://momentranks.com/topshot/account/mariodustice?limit=250'
page = driver.get(url)
time.sleep(5) #edit the time of this depending on your case (in seconds)
soup = BeautifulSoup(page.source, 'lxml')
table = soup.find_all('table', attrs={'class':'Table_tr__1JI4P'})
Upvotes: 0