Reputation: 229
I need to scrape a webpage the link to which is here In this webpage there is a Cross Reference section that I want to scrape But when I use the python requests to collect the content of the page by below code:
url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
The resultant content does not have that cross reference part maybe bcz its not loaded.I can scrape the rest of the html content but not the cross reference part. Now when I did the same thing with selenium it worked fine which means selenium is able to find this element after its loaded. Can Someone guide me how should I be able to get this done using python requests and beautifulsoup instead of selenium?
Upvotes: 0
Views: 416
Reputation: 193088
Selenium alone will be enough to scrape the Cross References section inducing WebDriverWait for visibility_of_all_elements_located()
and you can use either of the following Locator Strategies:
Using CSS_SELECTOR
:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.WideSidebarProductList-list h4")))])
Using XPATH
:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[@class='WideSidebarProductList-list']//h4")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
['CGB3B1X5R1A475M055AC', 'CL10A475MP8NNNC', 'GRM185R61A475ME11D', 'C0603C475M8PACTU']
Upvotes: 0
Reputation: 195418
The data is loaded through Javascript, but you can extract it with requests
, BeautifulSoup
and json
module:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.arrow.com/en/products/lmk107bbj475mklt/taiyo-yuden'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
}
soup = BeautifulSoup(requests.get(url, headers=headers).text, 'lxml')
t = soup.select_one('#arrow-state').text
t = t.replace('&q;', '"').replace('&g;', ">").replace('&l;', "<").replace('&a;', "&")
data = json.loads( t )
d = None
for item in data['jss']['sitecore']['route']['placeholders']['arrow-main']:
if item['componentName'] == 'PdpWrapper':
d = item
break
if d:
cross_reverence_product_tiles = d['placeholders']['product-details'][0]['fields']['crossReferenceProductTilesCollection']['crossReverenceProductTiles']['productTiles']
print(json.dumps(cross_reverence_product_tiles, indent=4))
Prints:
[
{
"partId": "16571604",
"partNumber": "CGB3B1X5R1A475M055AC",
"productDetailUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
"productDetailShareUrl": "/en/products/cgb3b1x5r1a475m055ac/tdk",
"productImage": "https://static5.arrow.com/pdfs/2017/4/18/7/26/14/813/tdk_/manual/010101_lowprofile_pi0402.jpg",
"manufacturerName": "TDK",
"productLineTitle": "Capacitor Ceramic Multilayer",
"productDescription": "Cap Ceramic 4.7uF 10V X5R 20% Pad SMD 0603 85\u00b0C T/R",
"datasheetUrl": "",
"lowestPrice": 0.0645,
"lowestPriceFormatted": "$0.0645",
"highestPrice": 0.3133,
"highestPriceFormatted": "$0.3133",
"stockFormatted": "1,875",
"stock": 1875,
"attributes": [],
"buyingOptionType": "AddToCart",
"numberOfAttributesToShow": 1,
"rrClickTrackingUrl": null,
"pricingDataPopulated": true,
"sourcePartId": "V72:2272_06586404",
"sourceCode": "ACNA",
"packagingType": "Cut Strip",
"unitOfMeasure": "",
"isDiscontinued": false,
"productTileHint": null,
"tileSize": 1,
"tileType": "1x1",
"suplementaryClasses": "u-height"
},
...and so on.
Upvotes: 1