AnotherUser31
AnotherUser31

Reputation: 109

Scrape a url in Python

I'm trying to get the adidas shoe link from a search page, can't figure it out what I'm doing wrong.

I tried tags = soup.find("section", {"class": "productList"}).findAll("a") Doesnt work :(

I also tried to print all href and the desired link is not in there :(

So I'm expecting to print this :

https://www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-62138


from bs4 import BeautifulSoup
import requests

url = "https://www.tennisexpress.com/search.cfm?searchKeyword=BB6892"

# Getting the webpage, creating a Response object.
response = requests.get(url)

# Extracting the source code of the page.
data = response.text

# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')

# Extracting all the <a> tags into a list.
tags = soup.find("section", {"class": "productList"}).findAll("a")

# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
    print(tag.get('href'))

Here's the html code for that link

<section class="productList"> <article class="productListing"> <a class="product" href="//www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-62138" title="Men`s Adizero Ubersonic 50 Yrs LTD Tennis Shoes Off White and Signal Blue" onmousedown="return nxt_repo.product_x('38698770','1');"> <span class="sale">SALE</span> <span class="image"> <img src="//www.tennisexpress.com/prodimages/78091-DEFAULT-m.jpg" alt="Men`s Adizero Ubersonic 50 Yrs LTD Tennis Shoes Off White and Signal Blue"> </span> <span class="brand"> Adidas </span> <span class="name"> Men`s Adizero Ubersonic 50 Yrs LTD Tennis Shoes Off White and Signal Blue </span> <span class="pricing"> <strong class="listPrice">$140.00</strong> <strong class="percentOff">0% OFF</strong> <strong class="salePrice">$139.95</strong> </span> <br> </a> </article> </section>

Upvotes: 2

Views: 3433

Answers (3)

radzak
radzak

Reputation: 3118

By inspecting Network tab in Chrome DevTools you can notice that the products you search are fetched after making a request to https://tennisexpress-com.ecomm-nav.com/search.js. You can see example response here. As you can see, it's a mess, so I wouldn't follow this approach.

In your code, you couldn't see the products because the request is made by JavaScript (running in your browser) after the initial page load. Neither standalone urllib nor requests can render that content. However you can do that with Requests-HTML that has JavaScript support (it uses Chromium behind the scenes).

Code:

from itertools import chain
from requests_html import HTMLSession

session = HTMLSession()
url = 'https://www.tennisexpress.com/search.cfm?searchKeyword=adidas+boost'
r = session.get(url)
r.html.render()

links = list(chain(*[prod.absolute_links for prod in r.html.find('.product')]))

I used chain to join all the sets with absolute links together and I created a list out of it.

>>> links
['https://www.tennisexpress.com/adidas-mens-barricade-2018-boost-tennis-shoes-black-and-night-metallic-62110',
 'https://www.tennisexpress.com/adidas-mens-barricade-2018-boost-tennis-shoes-white-and-matte-silver-62109',
 ...
 'https://www.tennisexpress.com/adidas-mens-supernova-glide-7-running-shoes-black-and-white-41636',
 'https://www.tennisexpress.com/adidas-womens-adizero-boston-6-running-shoes-solar-yellow-and-midnight-gray-45268']

Don't forget to install Requests-HTML with pip install requests-html.

Upvotes: 2

Paula Thomas
Paula Thomas

Reputation: 1190

Right here's the solution:

import requests
import bs4.BeautifulSoup as bs
url="https://www.tennisexpress.com/mens-adidas-tennis-shoes"
req = requests.get(url)
soup = bs(req.text,'lxml') # lxml because page is more xml than html
arts = soup.findAll("a",class_="product")

and that gives you a list of links to all the adidas tennis shoes! I'm sure you can manage from there.

Upvotes: 0

harshvardhan
harshvardhan

Reputation: 805

soup = BeautifulSoup(data, "html.parser")    
markup = soup.find_all("section", class_=["productList"])
markupContent = markup.get_text()

So your code goes like

import urllib
from bs4 import BeautifulSoup
import requests

url = "https://www.tennisexpress.com/search.cfm?searchKeyword=BB6892"

r = urllib.urlopen(url).read()
soup = BeautifulSoup(r, "html.parser")
productMarkup = soup.find_all("section", class_=["productList"])
product = productMarkup.get_text()

Upvotes: 1

Related Questions