Reputation: 6496
I am trying to scrape a JavaScript table from a website to a dataframe. The soup outputs only the script location and not access to the table. The MWE and soup output are given below. I am trying to scrape the table from here to a dataframe, is this possible and how?
MWE
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/72.0.3626.28 Safari/537.36'}
session = requests.Session()
website = session.get('https://iborrowdesk.com', headers=headers, timeout=10)
website.raise_for_status()
soup = BeautifulSoup(website.text, 'lxml')
table = soup.find('table', class_='table table-condensed table-hover')
data = pd.read_html(str(table))[0]
Soup output
<html><head><link href="/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/site.webmanifest" rel="manifest"/>
<link color="#5bbad5" href="/safari-pinned-tab.svg" rel="mask-icon"/>
<meta content="#da532c" name="msapplication-TileColor"/>
<meta content="#ffffff" name="theme-color"/>
<link href="https://maxcdn.bootstrapcdn.com/bootswatch/3.3.6/flatly/bootstrap.min.css" rel="stylesheet"/>
<meta charset="utf-8"/><meta content="width=device-width,initial-scale=1" name="viewport"/>
<title>IBorrowDesk</title><script src="//cdn.thisiswaldo.com/static/js/9754.js"></script>
</head><body><div class="container"></div><script src="/static/main.bundle.js?39ed89dd02e44899ebb4">
</script></body></html>
Upvotes: 0
Views: 563
Reputation: 51
As Jason Baker mentioned in his post, you can use the API that's provided. Alternatively, you can use Selenium to scrape the data as well. This question (Python webscraping: BeautifulSoup not showing all html source content) is relevant to your question. It contains an explanation of why requests.Session().get(url) is unable to retrieve all of the elements in the DOM. It's because the elements are created using JavaScript, so the page source HTML doesn't initially contain those elements, they're inserted using JavaScript. The question I linked also contains a code snippet in the answers that I've updated to match your question:
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
browser = webdriver.Firefox()
browser.get('https://iborrowdesk.com/')
table = browser.find_element(By.TAG_NAME, 'table').get_attribute("outerHTML")
data = pd.read_html(table)[0]
print(data)
Upvotes: 0
Reputation: 3706
You can use requests since they are exposing an api.
import json
import pandas as pd
import requests
def get_data() -> pd.DataFrame:
url = "https://iborrowdesk.com/api/most_expensive"
with requests.Session() as request:
response = request.get(url, timeout=10)
if response.status_code != 200:
print(response.raise_for_status())
data = json.loads(response.text)
return pd.json_normalize(data=data["results"])
df = get_data()
Upvotes: 2