laosnd
laosnd

Reputation: 127

How to handle two extracted elements respecting the order they appear - Python Selenium

I have two elements that need to be extracted in the same hierarchy order. 'Name' repeats across the page 20 times but the 'URL' only exists in 5 of them. However, I need to know that the 'URL' identified in order 2 refers to the 'Name' identified in order 10, for example.

The problem is the structure. When there is no 'URL', there is no tag 'a' with 'href'.

Structure:

When I have 'URL'=

<div class="middle">
   <h5>
      <p>
         <a href="www.example.com" target="_blank" class="entry-name clickable">
            <span class>NAME</span>
         </a>
      </p>
   </h5>
</div>

When I don't have URL:

<div class="middle">
   <h5>
      <p>
         <span class = "entry-author">
            <span class>NAME</span>
         <span>
      </p>
   </h5>
</div>

My code is working in this way:

name = driver.find_elements_by_xpath("//*[@class='middle']/h5/p")
name_list = []
for n in name:
     name = n.text
     name_list.append(name)

url = driver.find_elements_by_xpath("//*[@class='middle']/h5/p/a")

url_list = []
for n in url:
    url = n.get_attribute('href')
    url_list.append(url)

In 'name_list' I have 20 results and in url_list I have only 5 results. The problem is I don't know which URL represents which name.

Ideally, in this example 'url' would return 20 elements with null values where there is no URL.

Upvotes: 2

Views: 78

Answers (1)

Muhteva
Muhteva

Reputation: 2832

This should work:

driver = webdriver.Chrome()
driver.get(url)
time.sleep(10)  # takes time to start

##### Scroll down #####
time.sleep(2)
last_height = driver.execute_script("return document.body.scrollHeight") 
time.sleep(1) 
last_height = last_height/2 
time.sleep(1) 
driver.execute_script("window.scrollTo(0,{})".format(last_height)) 
time.sleep(1)

name = driver.find_elements_by_xpath("//*[@class='middle']/h5/p")

list_ = []
for n in name:
    try:   # assuming there is no link
        name_ = n.find_element_by_xpath(".//span[@class='entry-author']").text
        url_ = ""
    except:  #assuming there is link
        name_ = n.find_element_by_xpath(".//a").text
        url_ = n.find_element_by_xpath(".//a").get_attribute('href')

    list_.append((name_, url_))

driver.quit()
print(list_)

Output is a list of tuples, each tuple consisting of name and URL.

Upvotes: 3

Related Questions