Luca
Luca

Reputation: 35

Web Scraper not pulling text

I'm trying to make a web scraper that grabs data from the Milwaukee tools website. And I can make a request and download the website but I can't seem to get the text title. All I get is <div class="result-title" v-html="result.Title"></div> witch is not the data I need. What I it to return is <div class="result-title">M18 FUEL™ HAMMERVAC™ 1-1/8” Dedicated Dust Extractor</div> witch is the first entry on the website. This is my code:

from bs4 import BeautifulSoup
import requests

html_text = requests.get('https://www.milwaukeetool.com/Products/Power-Tools/Drilling').text
soup = BeautifulSoup(html_text, 'html5lib')
tools = soup.find('div', class_ = 'product-listing__result')
name = tools.find('div', class_ = 'result-title')
print(name)
sku = tools.find('div', class_="result-sku")

Any help is appreciated.

Upvotes: 0

Views: 107

Answers (2)

Allan
Allan

Reputation: 36

In case you want the list of the results (Titles) in no particular order you can do this:

import requests
import json

params = {
    'Availability': False,
    'Categories': 'BA16CBC0-793E-407A-AED6-DDBB1359AA10',
    'FacetList': '501a4ad5-8e79-40bf-b125-d1cdc48d49ea|d77896af-2dc6-4e90-8d40-3a7c211bb04e|135e489f-c19c-46cd-834f-93092fe8da25|a438883b-2015-4f97-91c8-9a1f1fc5de40',
    'Fuel': False,
    'Language': 'en',
    'NumberFacetValues': 8,
    'OneKey': False
}

page = requests.post('https://www.milwaukeetool.com/api/sitecore/products/GetProductsByProductListingQuery', params=params )
data = json.loads(page.content)

for item in data['Results']:
    print(item['Title'])

    # If you want to print/check the whole item and its properties, uncomment this
    # print(item)

    # if you want the URL of the product
    #link/url
    print("https://www.milwaukeetool.com/Products/" + item['PrimaryCategory'].replace(" ", "-") + "/" + item['SecondaryCategory'].replace(" ", "-") + "/" + item['Sku'])

Upvotes: 1

amd
amd

Reputation: 364

I think you should scrap the title using the page of the product because when you try to get the HTML of this linkhttps://www.milwaukeetool.com/Products/Power-Tools/Drilling there are some tags not loaded so the data you want it will not return to you.

Maybe this way can help you to get the data you want.

import requests
from bs4 import BeautifulSoup
  

#add to this list all URLs you want to get the title from

urls = ['https://www.milwaukeetool.com/Products/Power-Tools/Drilling/2915-DE',
'https://www.milwaukeetool.com/Products/Power-Tools/Drilling/2706-20']

for url in urls:
    reqs = requests.get(url)
    # using the BeaitifulSoup module
    soup = BeautifulSoup(reqs.text, 'html.parser')
    # displaying the title
    title = soup.find('h1', class_ = 'product-info__title')
    print(title.text)

EDIT

Each URL of products end with Sku so You need to take that Sku and add it at the end ofget_urls but this way is too slow because you will need to scrap each page. it can be useful if you need other data that cannot find in this API.

import requests
from bs4 import BeautifulSoup
import json


get_urls = 'https://www.milwaukeetool.com/Products/Power-Tools/Drilling/'
params = {
    'Availability': False,
    'Categories': 'BA16CBC0-793E-407A-AED6-DDBB1359AA10',
    'FacetList': '501a4ad5-8e79-40bf-b125-d1cdc48d49ea|d77896af-2dc6-4e90-8d40-3a7c211bb04e|135e489f-c19c-46cd-834f-93092fe8da25|a438883b-2015-4f97-91c8-9a1f1fc5de40',
    'Fuel': False,
    'Language': 'en',
    'NumberFacetValues': 8,
    'OneKey': False
}

page = requests.post('https://www.milwaukeetool.com/api/sitecore/products/GetProductsByProductListingQuery', params=params)
data = json.loads(page.content)


urls = [get_urls + item['Sku'] for item in data['Results']]

for url in urls:
    reqs = requests.get(url)
    # using the BeaitifulSoup module
    soup = BeautifulSoup(reqs.text, 'html.parser')
    # displaying the title
    title = soup.find('h1', class_ = 'product-info__title')
    print(title.text)

Upvotes: 1

Related Questions