dil-emma
dil-emma

Reputation: 11

Selenium Python, parsing through website, opening a new tab, and scraping

I'm new at Python and Selenium. I'm trying to do something--which im sure im going in a very round-about way--any help is greatly appreciated.

The page im trying to parse through has different cards that need to be clicked on, i need to go to each card, and from there grab the name (h1) and the url. I havent gotten very far, and this is what i have so far.

I go through the first page, grab all the urls, add them to a list. Then i want to go through the list, and go to each url (opening a new tab) and from there grabbing the h1 and url. It doesn't seem like I'm even able to grab the h1, and it opens a new tab, then hangs, then opens the same tab.

Thank you in advance!

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

driver = webdriver.Chrome()

driver.get('https://zdb.pedaily.cn/enterprise//') #main URL

title_links = driver.find_elements_by_css_selector('ul.n4 a')
urls = [] #list of URLs

# main = driver.find_elements_by_id('enterprise-list')

for item in title_links:
    urls.append(item.get_attribute('href'))

# print(urls)

for url in urls:
    driver.execute_script("window.open('');")
    driver.switch_to.window(driver.window_handles[1])
    driver.get(url)
    print(driver.find_element_by_css_selector('div.info h1'))

Upvotes: 0

Views: 625

Answers (1)

Boomer
Boomer

Reputation: 479

Well, there are a few issues here:

  • You should be much more specific with your tag for grabbing urls. This is leading to multiple copies of the same url--that's why it is opening the same pages again.
  • You should give the site enough time to load before trying to grab objects, that may be why it's timing out but always good to be on the safe side before grabbing objects.
  • You have to shift focus back to the original page to continue iterating the list
  • You don't need to inject JS to open a new tab and use a py call to open , and JS formatting could be cleaner
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

driver = webdriver.Chrome()
driver.get('https://zdb.pedaily.cn/enterprise/')  # main URL

# Be much more specific or you'll get multiple returns of the same link
urls = driver.find_elements(By.TAG_NAME, 'ul.n4 li div.img a')

for url in urls:
    # get href to print
    print(url.get_attribute('href'))
    # Inject JS to open new tab
    driver.execute_script("window.open(arguments[0])", url)
    # Switch focus to new tab
    driver.switch_to.window(driver.window_handles[1])
    # Make sure what we want has time to load and exists before trying to grab it
    WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.info h1')))
    # Grab it and print it's contents
    print(driver.find_element(By.CSS_SELECTOR, 'div.info h1').text)
    # Uncomment the next line to do one tab at a time. Will reduce speed but not use so much ram.
    #driver.close()
    # Focus back on first window
    driver.switch_to.window(driver.window_handles[0])
# Close window
driver.quit()

Upvotes: 1

Related Questions