sfactor
sfactor

Reputation: 13062

How to scrape html table only after data loads using Python Requests?

I am trying to learn data scraping using python and have been using the Requests and BeautifulSoup4 libraries. It works well for normal websites. But when I tried to get some data out of websites where the table data loads after some delay, I found that I get an empty table. An examples would be this webpage

The script I've tried is a fairly routine one.

import requests
from bs4 import BeautifulSoup

response = requests.get("http://www.oddsportal.com/soccer/england/premier-league/everton-arsenal-tnWxil2o#over-under;2")
soup = BeautifulSoup(response.text, "html.parser")

content = soup.find('div', {'id': 'odds-data-portal'})

The data loads in the table odds-data-portal in the page but the code doesn't give me that. How can I make sure the table is loaded with data and get it first?

Upvotes: 2

Views: 4896

Answers (2)

Martin Evans
Martin Evans

Reputation: 46759

You will need to use something like selenium to get the html. You could though continue to use BeautifulSoup to parse it as follows:

from bs4 import BeautifulSoup
from operator import itemgetter
from selenium import webdriver

url = "http://www.oddsportal.com/soccer/england/premier-league/everton-arsenal-tnWxil2o#over-under;2"
browser = webdriver.Firefox()

browser.get(url)
soup = BeautifulSoup(browser.page_source)
data_table = soup.find('div', {'id': 'odds-data-table'})

for div in data_table.find_all_next('div', class_='table-container'):
    row = div.find_all(['span', 'strong'])

    if len(row):
        print ','.join(cell.get_text(strip=True) for cell in itemgetter(0, 4, 3, 2, 1)(row))

This would display:

Over/Under +0.5,(8),1.04,11.91,95.5%
Over/Under +0.75,(1),1.04,10.00,94.2%
Over/Under +1,(1),1.04,11.00,95.0%
Over/Under +1.25,(2),1.13,5.88,94.8%
Over/Under +1.5,(9),1.21,4.31,94.7%
Over/Under +1.75,(2),1.25,3.93,94.8%
Over/Under +2,(2),1.31,3.58,95.9%
Over/Under +2.25,(4),1.52,2.59,95.7%   

Update - as suggested by @JRodDynamite, to run the headless PhantomJS can be used instead of Firefox. To do this:

  1. Download the PhantomJS Windows binary.

  2. Extract the phantomjs.exe executable and ensure it is in your PATH.

  3. Change the following line: browser = webdriver.PhantomJS()

Upvotes: 2

JRodDynamite
JRodDynamite

Reputation: 12613

Sorry, I can't open the link. But the table is probably generated in one of 2 ways:

  1. Purely by JavaScript with no AJAX call.
  2. Using an AJAX call and some JavaScript for DOM manipulation.

If it is the first case, then you have no option but to use selenium-webdriver in Python. Also, you can have a look at the example in this answer.

If it is the second case, then you can find out the URL and the data sent and then using requests module send a similar request to fetch the data. Data can be in JSON format or HTML (Depends on how good the developer is). You'll have to parse it accordingly.

Sometimes, the AJAX call may require, as data, a CSRF token or the cookie, in that case you'll have to revert back to the solution in the first case.

Upvotes: 4

Related Questions