silentcobra
silentcobra

Reputation: 1

Unable to scrape via selenium in python because of infinite page load

I am trying to extract the contents of some of the news articles. Some of the urls required logging in in order to access the full content. I decided to use selenium to automate logging in. However, I am not able to extract contents because the first url takes forever to load and never reaches the point where actual text extraction is done. It ends up throwing timeout exception.

Here is my code

for url in url_list:
    chrome_options = Options()
    ua = UserAgent()
    userAgent = ua.random
    options.add_argument(f'user-agent={userAgent}')
    driver = webdriver.Chrome(ChromeDriverManager().install(), options = chrome_options)
    driver.get(url)
    time.sleep(5)
    frame = driver.find_elements_by_xpath('//iframe[@id="wallIframe"]')
    #Some articles require going through a paywall and some don't
    if len(frame)==0:
        text_element = driver.find_elements_by_xpath('//section[@id="main-content"]//article//p')
        text = " ".join(x.text for x in element)
    else:
        text = log_in(frame)
    driver.quit()

Although the code never reaches to it, here is my log_in method

def log_in(frame):
    driver.switch_to.frame(frame[0])
    driver.find_element_by_id("PAYWALL_V2_SIGN_IN").click()
    time.sleep(2)
    driver.find_elements_by_id("username")[0].send_keys(username)
    time.sleep(2)
    driver.find_elements_by_xpath('//button[text()="Continue"]')[0].click()
    time.sleep(1)
    driver.find_elements_by_id("password")[0].send_keys(password)
    time.sleep(1)
    element = driver.find_elements_by_xpath('//button[@type="submit"]')[0].click()
    time.sleep(1)
    text = parse_text(element)

How can I get around this?

Upvotes: 0

Views: 118

Answers (2)

Max Shouman
Max Shouman

Reputation: 1331

Instead of manually setting the timeout with time.sleep, you should use WebDriverWait along with expected_conditions; this way the action to be done on your element will be performed only when a certain condition is satisfied (for example if the element is visible or if the element is clickable).

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

try:
    frame = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//iframe[@id="wallIframe"]')))

except TimeoutException:
    print "Element not found."

Upvotes: 1

Mate Mrše
Mate Mrše

Reputation: 8444

Add this to your chrome options:

options.page_load_strategy = 'eager'

"Eager" page load strategy will wait until the HTML has been loaded, but won't wait for loading of CSS, images and such.

There is more on that here.

Upvotes: 0

Related Questions