Unable to scrape ajax loaded elements on a webpage python

Question

I need to scrape a webpage the link to which is here In this webpage there is a Cross Reference section that I want to scrape But when I use the python requests to collect the content of the page by below code:

url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

The resultant content does not have that cross reference part maybe bcz its not loaded.I can scrape the rest of the html content but not the cross reference part. Now when I did the same thing with selenium it worked fine which means selenium is able to find this element after its loaded. Can Someone guide me how should I be able to get this done using python requests and beautifulsoup instead of selenium?

Andrej Kesely · Accepted Answer

The data is loaded through Javascript, but you can extract it with requests, BeautifulSoup and json module:

import json
import requests
from bs4 import BeautifulSoup

url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'

headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
    }

soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')

t = soup.select_one('#arrow-state').text
t = t.replace('&q;', '"').replace('&g;', ">").replace('&l;', "<").replace('&a;', "&")
data = json.loads( t )

d = None
for item in data['jss']['sitecore']['route']['placeholders']['arrow-main']:
    if item['componentName'] == 'PdpWrapper':
        d = item
        break

if d:
    cross_reverence_product_tiles = d['placeholders']['product-details'][0]['fields']['crossReferenceProductTilesCollection']['crossReverenceProductTiles']['productTiles']
    print(json.dumps(cross_reverence_product_tiles, indent=4))

Prints:

[
    {
        "partId": "16571604",
        "partNumber": "CGB3B1X5R1A475M055AC",
        "productDetailUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
        "productDetailShareUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
        "productImage": "https://static5.arrow.com/pdfs/2017/4/18/7/26/14/813/tdk_/manual/010101_lowprofile_pi0402.jpg",
        "manufacturerName": "TDK",
        "productLineTitle": "Capacitor Ceramic Multilayer",
        "productDescription": "Cap Ceramic 4.7uF 10V X5R 20% Pad SMD 0603 85\u00b0C T/R",
        "datasheetUrl": "",
        "lowestPrice": 0.0645,
        "lowestPriceFormatted": "$0.0645",
        "highestPrice": 0.3133,
        "highestPriceFormatted": "$0.3133",
        "stockFormatted": "1,875",
        "stock": 1875,
        "attributes": [],
        "buyingOptionType": "AddToCart",
        "numberOfAttributesToShow": 1,
        "rrClickTrackingUrl": null,
        "pricingDataPopulated": true,
        "sourcePartId": "V72:2272_06586404",
        "sourceCode": "ACNA",
        "packagingType": "Cut Strip",
        "unitOfMeasure": "",
        "isDiscontinued": false,
        "productTileHint": null,
        "tileSize": 1,
        "tileType": "1x1",
        "suplementaryClasses": "u-height"
    },

...and so on.

Unable to scrape ajax loaded elements on a webpage python

Answers (2)

Related Questions