Reputation: 1
Sorry if this post seems like a duplicate but i cant find a working way to do this.
import requests
from bs4 import BeautifulSoup
from lxml import etree as et
import time
import random
import csv
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'
}
bucket_list = ['https://www.amazon.co.uk/Military-Analogue-Waterproof-Tactical-Minimalist/dp/B0B6C7RMQD/']
def get_product_name(dom):
try:
name = dom.xpath('//span[@id="productTitle"]/text()')
[name.strip() for name in name]
return name[0]
except Exception as e:
name = 'Not Available'
return None
with open('master_data.csv', 'w') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(['product name', 'url'])
for url in bucket_list:
response = requests.get(url, headers=header)
soup = BeautifulSoup(response.content, 'html.parser')
amazon_dom = et.HTML(str(soup))
product_name = get_product_name(amazon_dom)
time.sleep(random.randint(2, 5))
writer.writerow([product_name, url])
print(product_name, url)
i have this code that opens the link and looks for its name and pastes it into a csv file but it pastes nothing. how can i fix this?
Upvotes: 0
Views: 831
Reputation: 623
Amazon is a heavily dynamic website; meaning it loads programatically (using JS). Simply using requests is usually not enough to scrape Amazon. So the reason you don't get any result is probably because your response
doesn't actually have any dom.xpath('//span[@id="productTitle"]/text()')
.
If you want to scrape Amazon, consider using Selenium.
First things first, in order to render JavaScript, you need to use an actual browser. Since you're script is in Python, I recommend you to install Selenium and use it with an HTML parser (like BeautifulSoup) in order to extract your data. Here is an implementation example:
from cmath import exp
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
from lxml import etree
BUCKET_LIST = ['https://www.amazon.co.uk/Military-Analogue-Waterproof-Tactical-Minimalist/dp/B0B6C7RMQD/']
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 5000)
titles = []
for url in BUCKET_LIST:
driver.get(url)
title = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '#productTitle')))
titles.append(title.text)
driver.quit()
print(titles)
But then you also have to take into consideration the fact that Amazon takes a lot of measures to prevent scraping.
Upvotes: 3