VenuBhaskar
VenuBhaskar

Reputation: 54

selenium in python is skipping articles while trying to scrape the data

Im trying to extract data from articles using selenium in python, the code is identifying the articles but while running the loop a few articles are skipped randomly. Any help resolving this issue will be appreciated.

#Importing libraries
import requests
import os
import json
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup  
import time
import requests
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import traceback
from webdriver_manager.chrome import ChromeDriverManager  

#opening a chrome instance
options = webdriver.ChromeOptions() 
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(options=options, executable_path=r"C:/selenium/chromedriver.exe")

#getting into the website
driver.get('https://academic.oup.com/rof/issue/2/2')

#getting the articles
articles = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, '/html/body/div[3]/main/section/div/div/div[1]/div/div[3]/div[2]/div[3]/div/div/div/div/h5')))

#loop to get in and out of articles
for article in articles:
    try:
        ActionChains(driver).key_down(Keys.CONTROL).click(article).key_up(Keys.CONTROL).perform()
        WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
        window1 = driver.window_handles[1]
        driver.switch_to_window(window1)
        driver.close()
        driver.switch_to_window(window0)
    except:
        print("couldnt get the article")

Upvotes: 0

Views: 397

Answers (1)

frianH
frianH

Reputation: 7563

First, for collect all article element, you can use this css selector:

articles = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.customLink.item-title a')))

Second, This is wrong method:

driver.switch_to_window(window1)

It's should:

driver.switch_to.window(window1)

See the difference between _ and . above.

Third, you forgot to initialize the window0 variable:

window0 = driver.window_handles[0]

And finally, try the following code:

#getting into the website
driver.get('https://academic.oup.com/rof/issue/2/2')

#getting the articles
articles = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.customLink.item-title a')))

#loop to get in and out of articles
for article in articles:
    try:
        ActionChains(driver).key_down(Keys.CONTROL).click(article).key_up(Keys.CONTROL).perform()
        WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
        window1 = driver.window_handles[1]
        driver.switch_to.window(window1)
        driver.close()
        window0 = driver.window_handles[0]
        driver.switch_to.window(window0)
    except:
        print("couldnt get the article")

driver.quit()

Upvotes: 1

Related Questions