DamonW
DamonW

Reputation: 11

Python Selenium (Scrape Product Info from CJDropship)

I can not figure out how to scrape it, seems like the info is being hidden by Ng-show and after many attempts, nothing I found seems to work.

Website: https://cjdropshipping.com/product/silicone-grip-device-finger-exercise-stretcher-finger-gripper-strength-trainer-strengthen-rehabilitation-training-p-1614453269613522944.html?from=HTP

I want to scrape the product description and the shipping time

This is my current code:

from selenium import webdriver
from selenium.webdriver.common.by import By


# Set up the Chrome driver
driver = webdriver.Chrome()

# Navigate to the website
driver.get("https://cjdropshipping.com/product/silicone-grip-device-finger-exercise-stretcher-finger-gripper-strength-trainer-strengthen-rehabilitation-training-p-1614453269613522944.html?from=HTP")

# Find the element that contains the title of the product
title_element = driver.find_element(By.CSS_SELECTOR, 'div > div > div > div > div > div > pro-detail > div').get_attribute("textContent")
print(title_element)
# Extract the text from the element
title = title_element.text

# Print the title
print(title)

# Close the driver
driver.quit()

Upvotes: 0

Views: 103

Answers (2)

Ajeet Verma
Ajeet Verma

Reputation: 3031

You need to wait for a few seconds for the target web elements or the contents on the page to load before you can find them.

[update] And You also need to scroll down up to the height of the description section to load the description information.

Here is the updated solution:

from time import sleep
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

driver.get("https://cjdropshipping.com/product/silicone-grip-device-finger-exercise-stretcher-finger-gripper-strength-trainer-strengthen-rehabilitation-training-p-1614453269613522944.html?from=HTP")
WebDriverWait(driver, 2).until(EC.presence_of_element_located((By.ID, "pd-merchName")))

# scroll down in steps by window height 1000 to load the description
driver.execute_script("window.scrollBy(0, 1000);")
sleep(2)

soup = BeautifulSoup(driver.page_source, 'lxml')
title_element = soup.find('div', attrs={"id": "pd-merchName"}).text.strip()
print(title_element)

description1 = soup.find('div', attrs={"class": "pd-new-desc info-box"}).text.strip()
description2 = [i.text for i in soup.find('div', attrs={"id": "pd-description"}).find_all('p')]

print(description1)
print(description2)

Upvotes: 0

undetected Selenium
undetected Selenium

Reputation: 193048

To extract the Product Info ideally you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

  • Using CSS_SELECTOR and text attribute:

    driver.get('https://cjdropshipping.com/product/silicone-grip-device-finger-exercise-stretcher-finger-gripper-strength-trainer-strengthen-rehabilitation-training-p-1614453269613522944.html?from=HTP')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div#subscribe-box > img"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div#pd-merchName > div"))).text)
    
  • Using XPATH and get_attribute("innerHTML"):

    driver.get('https://cjdropshipping.com/product/silicone-grip-device-finger-exercise-stretcher-finger-gripper-strength-trainer-strengthen-rehabilitation-training-p-1614453269613522944.html?from=HTP')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div#subscribe-box > img"))).click()
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='pd-merchName']/div"))).get_attribute("innerHTML").strip())
    
  • Console Output:

    Silicone Grip Device Finger Exercise Stretcher Finger Gripper Strength Trainer Strengthen Rehabilitation Training
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python


References

Link to useful documentation:

Upvotes: 0

Related Questions