Reputation: 175
I'm trying to scrape data from https://www.finishline.com using either Selenium or Beautifulsoup 4. So far I have been unsuccessful so I've turned to Stackoverflow for assistance - hoping that someone knows a way around their scraping protection.
I tried using Beautifulsoup 4 and Selenium. Below are some simple examples.
General imports used in my main program:
import requests
import csv
import io
import os
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from datetime import datetime
from bs4 import BeautifulSoup
Beautifulsoup 4 code:
data2 = requests.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004")
soup2 = BeautifulSoup(data2.text, 'html.parser')
x = soup2.find('h1', attrs={'id': 'title'}).text.strip()
print(x)
Selenium code:
options = Options()
options.headless = True
options.add_argument('log-level=3')
driver = webdriver.Chrome(options=options)
driver.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004")
x = driver.find_element_by_xpath("//h1[1]")
print(x)
driver.close()
Both of those snippets are attempts at getting the product title from the product page.
The Beautifulsoup 4 snippet sometimes just gets stuck and does nothing, and other times it returns
requests.exceptions.ConnectionError: ('Connection aborted.', OSError("(10060, 'WSAETIMEDOUT')"))
The Selenium snippet returns
<selenium.webdriver.remote.webelement.WebElement (session="b3707fb7d7b201e2fa30dabbedec32c5", element="0.10646785765405364-1")>
which means it did find the element, but when I try to convert it to text by changing
x = driver.find_element_by_xpath("//h1[1]")
to
x = driver.find_element_by_xpath("//h1[1]").text
it returns Access Denied
, which is also what the site itself sometimes returns in the browser. It can be bypassed by clearing cookies.
Does anyone know of a way to scrape data from this website? Thanks in advance.
Upvotes: 5
Views: 557
Reputation: 1620
The requests is rejected by server because of user agents, i added user agent to the request.
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
data2 = requests.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004",headers=headers)
soup2 = BeautifulSoup(data2.text, 'html.parser')
x = soup2.find('h1', attrs={'id': 'title'}).text.strip()
print(x)
Output:
Men's Nike Air Max 95 SE Casual Shoes
Upvotes: 3
Reputation: 1439
Try as this, for me it works, it returns MEN'S NIKE AIR MAX 95 SE CASUAL SHOES
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
driver = webdriver.Chrome()
driver.get("https://www.finishline.com/store/product/mens-nike-air-max-95-se-casual-shoes/prod2783292?styleId=AJ2018&colorId=004")
x = driver.find_element_by_xpath('//*[@id="title"]')
print(x.text)
Upvotes: 1