A.Hamza
A.Hamza

Reputation: 229

Unable to scrape ajax loaded elements on a webpage python

I need to scrape a webpage the link to which is here In this webpage there is a Cross Reference section that I want to scrape But when I use the python requests to collect the content of the page by below code:

url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

The resultant content does not have that cross reference part maybe bcz its not loaded.I can scrape the rest of the html content but not the cross reference part. Now when I did the same thing with selenium it worked fine which means selenium is able to find this element after its loaded. Can Someone guide me how should I be able to get this done using python requests and beautifulsoup instead of selenium?

Upvotes: 0

Views: 416

Answers (2)

undetected Selenium
undetected Selenium

Reputation: 193088

Selenium alone will be enough to scrape the Cross References section inducing WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:

  • Using CSS_SELECTOR:

      print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.WideSidebarProductList-list h4")))])
    
  • Using XPATH:

      print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='WideSidebarProductList-list']//h4")))])
    
  • Note : You have to add the following imports :

      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
    
  • Console Output:

      ['CGB3B1X5R1A475M055AC', 'CL10A475MP8NNNC', 'GRM185R61A475ME11D', 'C0603C475M8PACTU']
    

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195418

The data is loaded through Javascript, but you can extract it with requests, BeautifulSoup and json module:

import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'

headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
    }

soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

t = soup.select_one('#arrow-state').text
t = t.replace('&q;', '"').replace('&g;', ">").replace('&l;', "<").replace('&a;', "&")
data = json.loads( t )

d = None
for item in data['jss']['sitecore']['route']['placeholders']['arrow-main']:
    if item['componentName'] == 'PdpWrapper':
        d = item
        break

if d:
    cross_reverence_product_tiles = d['placeholders']['product-details'][0]['fields']['crossReferenceProductTilesCollection']['crossReverenceProductTiles']['productTiles']
    print(json.dumps(cross_reverence_product_tiles, indent=4))

Prints:

[
    {
        "partId": "16571604",
        "partNumber": "CGB3B1X5R1A475M055AC",
        "productDetailUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
        "productDetailShareUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
        "productImage": "https://static5.arrow.com/pdfs/2017/4/18/7/26/14/813/tdk_/manual/010101_lowprofile_pi0402.jpg",
        "manufacturerName": "TDK",
        "productLineTitle": "Capacitor Ceramic Multilayer",
        "productDescription": "Cap Ceramic 4.7uF 10V X5R 20% Pad SMD 0603 85\u00b0C T/R",
        "datasheetUrl": "",
        "lowestPrice": 0.0645,
        "lowestPriceFormatted": "$0.0645",
        "highestPrice": 0.3133,
        "highestPriceFormatted": "$0.3133",
        "stockFormatted": "1,875",
        "stock": 1875,
        "attributes": [],
        "buyingOptionType": "AddToCart",
        "numberOfAttributesToShow": 1,
        "rrClickTrackingUrl": null,
        "pricingDataPopulated": true,
        "sourcePartId": "V72:2272_06586404",
        "sourceCode": "ACNA",
        "packagingType": "Cut Strip",
        "unitOfMeasure": "",
        "isDiscontinued": false,
        "productTileHint": null,
        "tileSize": 1,
        "tileType": "1x1",
        "suplementaryClasses": "u-height"
    },

...and so on.

Upvotes: 1

Related Questions