Reputation: 1728
Here is the link of website from where I want to extract data,
I'm trying to get all text of href
attribute under anchor tag.
Here is the sample html:
<div id="borderForGrid" class="border">
<h5 class="">
<a href="/products/product-details/?prod=30AD">A/D TC-55 SEALER</a>
</h5>
<div id="borderForGrid" class="border">
<h5 class="">
<a href="/products/product-details/?prod=P380">Carbocrylic 3356-1</a>
</h5>
I want to extract all text values like ['A/D TC-55 SEALER','Carbocrylic 3356-1']
.
I tried with:
target = driver.find_element_by_class_name('border')
anchorElement = target.find_element_by_tag_name('a')
anchorElement.text
but it gives ''
(empty) string.
Any suggestion on how can it be achieved?
PS - Select first value of radio button under PRODUCT TYPE
Upvotes: 0
Views: 2842
Reputation: 897
Looks like when the website is first loaded all products are loaded as well. The pagination at the bottom does not actually change to different pages. Therefore you are able to extract all products on the very first request of http://www.carboline.com/products/
. I used python requests
to fetch the websites HTML
and lxml html
to parse the HTML
.
I would stay away from selenium, etc.. if possible (sometimes you have no choice). But if the website is super simple like the one in your question. Then I would recommend just making a request
. This avoids having to use a browser with all the extra overhead because you are only requesting what you need.
**I updated my answer to also show you how you can extract the href
and text
at the same time.
import requests
from lxml import html
BASE_URL = 'http://www.carboline.com'
def extract_data(tree):
elements = [
e
for e in tree.cssselect('div.border h5 a')
if e.text is not None
]
return elements
def build_data(data):
dataset = []
for d in data:
link = BASE_URL + d.get('href')
title = d.text
dataset.append(
{
'link':link,
'title':title
}
)
return dataset
def request_website(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
r = requests.get(url, headers=headers)
return r.text
response = request_website('http://www.carboline.com/products/')
tree = html.fromstring(response)
data = extract_data(tree)
dataset = build_data(data)
print (dataset)
Upvotes: 2
Reputation: 193058
To extract all the text values within the <a>
tags e.g. ['A/D TC-55 SEALER','Carbocrylic 3356-1'], you have to induce WebDriverWait for the visibility_of_all_elements_located()
and you can use either of the following solutions:
Using CSS_SELECTOR
:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.topLevel[data-types='Acrylics'] h5>a[href^='/products/product-details/?prod=']")))])
Using XPATH
:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//li[@class='topLevel' and @data-types='Acrylics']//h5[@class]/a[starts-with(@href, '/products/product-details/?prod=')]")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Upvotes: 1
Reputation: 168002
If you need all links values you should be using find_elements_....
functions, not find_element_...
functions as the latter one will return you first single match.
Recommended update for your code:
driver.get("http://www.carboline.com/products/")
for link in driver.find_elements_by_xpath("//ul[@id='productList']/descendant::*/a"):
if link.is_displayed():
print(link.text)
More information:
Upvotes: 1