Reputation: 571
I need to get the source from a page to use with BS4. However, the middle of the page takes 1 second(maybe less) to load the content, and requests.get
catches the source of the page before the section loads. How can I wait a second before getting the data?
r = requests.get(URL + self.search, headers=USER_AGENT, timeout=5)
soup = BeautifulSoup(r.content, 'html.parser')
a = soup.find_all('section', 'wrapper')
<section class="wrapper" id="resultado_busca">
Upvotes: 47
Views: 126451
Reputation: 424
I had the same problem, and none of the submitted answers really worked for me. But after long research, I found a solution:
from requests_html import HTMLSession
s = HTMLSession()
response = s.get(url)
response.html.render()
print(response)
# prints out the content of the fully loaded page
# response can be parsed with for example bs4
The requests_html
package (docs) is an official package, distributed by the Python Software Foundation. It has some additional JavaScript capabilities, like for example the ability to wait until the JS of a page has finished loading.
The package only supports Python Version 3.6 and above at the moment, so it might not work with another version.
Upvotes: 26
Reputation: 1210
Selenium is good way to solve that, but accepted answer is quite deprecated. As @Seth mentioned in comments headless mode of Firefox/Chrome (or possibly other browsers) should be used instead of PhantomJS.
First of all you need to download specific driver:
Geckodriver for Firefox
ChromeDriver for Chrome
Next you can add path to downloaded driver to system your PATH variable. But that's not necessary, you can also specify in code where executable lies.
Firefox:
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
# executable_path param is not needed if you updated PATH
browser = webdriver.Firefox(options=options, executable_path='YOUR_PATH/geckodriver.exe')
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
browser.quit()
Similarly for Chrome:
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
# executable_path param is not needed if you updated PATH
browser = webdriver.Chrome(options=options, executable_path='YOUR_PATH/chromedriver.exe')
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
browser.quit()
It's good to remember about browser.quit()
to avoid hanging processes after code execution. If you worry that your code may fail before browser is disposed you can wrap it in try...except
block and put browser.quit()
in finally
part to ensure it will be called.
Additionally, if part of source is still not loaded using that method, you can ask selenium to wait till specific element is present:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
browser = webdriver.Firefox(options=options, executable_path='YOUR_PATH/geckodriver.exe')
try:
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
timeout_in_seconds = 10
WebDriverWait(browser, timeout_in_seconds).until(ec.presence_of_element_located((By.ID, 'resultado_busca')))
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
except TimeoutException:
print("I give up...")
finally:
browser.quit()
If you're interested in other drivers than Firefox or Chrome check docs.
Upvotes: 14
Reputation: 412
Just to list my way of doing it, maybe it can be of value for someone:
max_retries = # some int
retry_delay = # some int
n = 1
ready = 0
while n < max_retries:
try:
response = requests.get('https://github.com')
if response.ok:
ready = 1
break
except requests.exceptions.RequestException:
print("Website not availabe...")
n += 1
time.sleep(retry_delay)
if ready != 1:
print("Problem")
else:
print("All good")
Upvotes: -5
Reputation: 618
I found a way to that !!!
r = requests.get('https://github.com', timeout=(3.05, 27))
In this, timeout has two values, first one is to set session timeout and the second one is what you need. The second one decides after how much seconds the response is sent. You can calculate the time it takes to populate and then print the data out.
Upvotes: 11
Reputation: 99
In Python 3, Using the module urllib
in practice works better when loading dynamic webpages than the requests
module.
i.e
import urllib.request
try:
with urllib.request.urlopen(url) as response:
html = response.read().decode('utf-8')#use whatever encoding as per the webpage
except urllib.request.HTTPError as e:
if e.code==404:
print(f"{url} is not found")
elif e.code==503:
print(f'{url} base webservices are not available')
## can add authentication here
else:
print('http error',e)
Upvotes: 5
Reputation: 6518
It doesn't look like a problem of waiting, it looks like the element is being created by JavaScript, requests
can't handle dynamically generated elements by JavaScript. A suggestion is to use selenium
together with PhantomJS
to get the page source, then you can use BeautifulSoup
for your parsing, the code shown below will do exactly that:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "http://legendas.tv/busca/walking%20dead%20s03e02"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
a = soup.find('section', 'wrapper')
Also, there's no need to use .findAll
if you are only looking for one element only.
Upvotes: 73