Reputation: 35
I'm trying to make a web scraper that grabs data from the Milwaukee tools website. And I can make a request and download the website but I can't seem to get the text title.
All I get is
<div class="result-title" v-html="result.Title"></div>
witch is not the data I need. What I it to return is <div class="result-title">M18 FUEL™ HAMMERVAC™ 1-1/8” Dedicated Dust Extractor</div>
witch is the first entry on the website.
This is my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.milwaukeetool.com/Products/Power-Tools/Drilling').text
soup = BeautifulSoup(html_text, 'html5lib')
tools = soup.find('div', class_ = 'product-listing__result')
name = tools.find('div', class_ = 'result-title')
print(name)
sku = tools.find('div', class_="result-sku")
Any help is appreciated.
Upvotes: 0
Views: 107
Reputation: 36
In case you want the list of the results (Titles) in no particular order you can do this:
import requests
import json
params = {
'Availability': False,
'Categories': 'BA16CBC0-793E-407A-AED6-DDBB1359AA10',
'FacetList': '501a4ad5-8e79-40bf-b125-d1cdc48d49ea|d77896af-2dc6-4e90-8d40-3a7c211bb04e|135e489f-c19c-46cd-834f-93092fe8da25|a438883b-2015-4f97-91c8-9a1f1fc5de40',
'Fuel': False,
'Language': 'en',
'NumberFacetValues': 8,
'OneKey': False
}
page = requests.post('https://www.milwaukeetool.com/api/sitecore/products/GetProductsByProductListingQuery', params=params )
data = json.loads(page.content)
for item in data['Results']:
print(item['Title'])
# If you want to print/check the whole item and its properties, uncomment this
# print(item)
# if you want the URL of the product
#link/url
print("https://www.milwaukeetool.com/Products/" + item['PrimaryCategory'].replace(" ", "-") + "/" + item['SecondaryCategory'].replace(" ", "-") + "/" + item['Sku'])
Upvotes: 1
Reputation: 364
I think you should scrap the title using the page of the product because when you try to get the HTML of this linkhttps://www.milwaukeetool.com/Products/Power-Tools/Drilling
there are some tags not loaded so the data you want it will not return to you.
Maybe this way can help you to get the data you want.
import requests
from bs4 import BeautifulSoup
#add to this list all URLs you want to get the title from
urls = ['https://www.milwaukeetool.com/Products/Power-Tools/Drilling/2915-DE',
'https://www.milwaukeetool.com/Products/Power-Tools/Drilling/2706-20']
for url in urls:
reqs = requests.get(url)
# using the BeaitifulSoup module
soup = BeautifulSoup(reqs.text, 'html.parser')
# displaying the title
title = soup.find('h1', class_ = 'product-info__title')
print(title.text)
EDIT
Each URL of products end with Sku
so You need to take that Sku and
add it at the end ofget_urls
but this way is too slow because you
will need to scrap each page. it can be useful if you need other data
that cannot find in this API.
import requests
from bs4 import BeautifulSoup
import json
get_urls = 'https://www.milwaukeetool.com/Products/Power-Tools/Drilling/'
params = {
'Availability': False,
'Categories': 'BA16CBC0-793E-407A-AED6-DDBB1359AA10',
'FacetList': '501a4ad5-8e79-40bf-b125-d1cdc48d49ea|d77896af-2dc6-4e90-8d40-3a7c211bb04e|135e489f-c19c-46cd-834f-93092fe8da25|a438883b-2015-4f97-91c8-9a1f1fc5de40',
'Fuel': False,
'Language': 'en',
'NumberFacetValues': 8,
'OneKey': False
}
page = requests.post('https://www.milwaukeetool.com/api/sitecore/products/GetProductsByProductListingQuery', params=params)
data = json.loads(page.content)
urls = [get_urls + item['Sku'] for item in data['Results']]
for url in urls:
reqs = requests.get(url)
# using the BeaitifulSoup module
soup = BeautifulSoup(reqs.text, 'html.parser')
# displaying the title
title = soup.find('h1', class_ = 'product-info__title')
print(title.text)
Upvotes: 1