Reputation: 161
I am trying to scrape this site:
https://www.lanebryant.com/perfect-sleeve-swing-tunic-top/prd-356831#color/0000009320
I want to get type of clothing, i.e. the category of the clothing.
There is a script on the page:
How can I collect this text and get the category of the clothing which I have highlighted in the image? I have tried the following code but it returns nothing.
type = d.find_element_by_xpath("//script[@type='text/javascript']").text
print("hiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii"+type)
d here is the driver
Upvotes: 0
Views: 1161
Reputation: 33384
Here you go...
1.Get the innerHTML
of the scripts tag
2.Convert into Json()
format
3.use the parameter
and then get the value tops
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import json
driver = webdriver.Chrome()
driver.get('https://www.lanebryant.com/perfect-sleeve-swing-tunic-top/prd-356831')
item = WebDriverWait(driver, 10).until(EC.presence_of_element_located(
(By.XPATH, "//script[@type='text/javascript'][contains(.,'window.lanebryantDLLite')]"))).get_attribute('innerHTML')
itemtext = item.split("=")[1].split(";")[0] # This will return as string
itemjson = json.loads(itemtext.strip()) # Converted here into json format
itemtop = itemjson['page']['pageName'] # Use the parameter to get the text
print(itemtop.split(':')[1].strip()) # Split here to get only value tops
Hope this helps.
Upvotes: 1
Reputation: 436
One of the problems with your current way is that you collect all scripts on the current page, you need to narrow it a bit.
This finds the correct script and then collects the category with the help of regex:
from lxml import html
import requests
import re
# create the regex
category_regex = re.compile(r'(?<="category": ").*(?=", "CategoryID")')
page = requests.get('https://www.lanebryant.com/perfect-sleeve-swing-tunic-top/prd-356831#color/0000009320')
tree = html.fromstring(page.content)
information = tree.xpath("//script[contains(text(), '\"page\": { \"pageName\": \"Clothing :')]/text()")
print(category_regex.findall(str(information)))
Output: ['Tops']
Upvotes: 0
Reputation: 1766
try something like this,
type = d.find_element_by_xpath('//script[@type="text/javascript"]').text
Also make a count of script tags in the page source.
Upvotes: 0