Reputation: 127
I have two elements that need to be extracted in the same hierarchy order. 'Name' repeats across the page 20 times but the 'URL' only exists in 5 of them. However, I need to know that the 'URL' identified in order 2 refers to the 'Name' identified in order 10, for example.
The problem is the structure. When there is no 'URL', there is no tag 'a' with 'href'.
Structure:
When I have 'URL'=
<div class="middle">
<h5>
<p>
<a href="www.example.com" target="_blank" class="entry-name clickable">
<span class>NAME</span>
</a>
</p>
</h5>
</div>
When I don't have URL:
<div class="middle">
<h5>
<p>
<span class = "entry-author">
<span class>NAME</span>
<span>
</p>
</h5>
</div>
My code is working in this way:
name = driver.find_elements_by_xpath("//*[@class='middle']/h5/p")
name_list = []
for n in name:
name = n.text
name_list.append(name)
url = driver.find_elements_by_xpath("//*[@class='middle']/h5/p/a")
url_list = []
for n in url:
url = n.get_attribute('href')
url_list.append(url)
In 'name_list' I have 20 results and in url_list I have only 5 results. The problem is I don't know which URL represents which name.
Ideally, in this example 'url' would return 20 elements with null values where there is no URL.
Upvotes: 2
Views: 78
Reputation: 2832
This should work:
driver = webdriver.Chrome()
driver.get(url)
time.sleep(10) # takes time to start
##### Scroll down #####
time.sleep(2)
last_height = driver.execute_script("return document.body.scrollHeight")
time.sleep(1)
last_height = last_height/2
time.sleep(1)
driver.execute_script("window.scrollTo(0,{})".format(last_height))
time.sleep(1)
name = driver.find_elements_by_xpath("//*[@class='middle']/h5/p")
list_ = []
for n in name:
try: # assuming there is no link
name_ = n.find_element_by_xpath(".//span[@class='entry-author']").text
url_ = ""
except: #assuming there is link
name_ = n.find_element_by_xpath(".//a").text
url_ = n.find_element_by_xpath(".//a").get_attribute('href')
list_.append((name_, url_))
driver.quit()
print(list_)
Output is a list of tuples, each tuple consisting of name and URL.
Upvotes: 3