Reputation: 21
I am trying to webscrape the urls that lead to product pages.
The page I webscape is the following one. https://groceries.morrisons.com/browse/fruit-veg-176738
However, my code only webscaped a part of information I would like to pick up(urls of each product page. I would like to fix code to pick up all urls of product pages. Can you help in solving this issue?
Here is my code:
# import required libraries
from bs4 import BeautifulSoup
import requests
# obtain page urls in meat and fish category
url='https://groceries.morrisons.com/browse/fruit-veg-176738'
# get source code from website using 'requests' library
source=requests.get(url).text
# create a BeautifulSoup object
soup=BeautifulSoup(source, 'lxml')
# find the source code of item list.
list_source=soup.find('div', class_='main-column')
# identify the location of urls of each item page
url_source=list_source.find('div', class_='fop-contentWrapper')
# get the urls
url_tail=url_source.a.attrs['href']
# full website address
url='https://groceries.morrisons.com/'+url_tail
url_list=[]
# grab all the urls using for loop
for url_source in list_source.find_all('div', class_='fop-contentWrapper'):
url_tail=url_source.a.attrs['href']
url='https://groceries.morrisons.com/'+url_tail
url_list.append(url)
The result of the code above only grab 67 URLs.
len(url_list)
67
However, the expected result is to grab 439 URLs.
len(url_list)
439
Upvotes: 0
Views: 226
Reputation: 11
This is a simple scraper I wrote a couple months ago, might help you.
import requests
from bs4 import BeautifulSoup as bs
#add user input for website to start with
url = requests.get(input("URL?: ")) # stored in requests var
# configuring Beautifulsoup
content = url.content
soup = bs(content, 'html.parser')
# open var to store data found
links = []
# loop looking for href tags with text embeded
for a in soup.find_all('a', href=True):
if a.text:
links.append(a['href'])
# informational prints
print("Status: ", url.status_code)
print("\nHeader: ", url.headers)
print("\nSpiders: ", links)
Upvotes: 0
Reputation: 21
This is the final code that fixed my problem. It was my first time to use selenium and webdriver, so it took me some time to make webdriver work as well mostly because chromedriver installation didn't go so smoothly.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import requests
import pandas as pd
import lxml.html as lh
#if the default chrome location doesn't match the chrome location in your computer,
#the code shows error message that says 'unknown error: cannot find chrome binary'
options = Options()
options.binary_location = "/Applications/Chrome 2.app/Contents/MacOS/Google Chrome"
# <- That was 'Google Chrome.exe' file's location in my computer, so I save it as binary_location.
driver = webdriver.Chrome(options = options, executable_path='/Users/MyName/Downloads/chromedriver')
# That was the location of 'chromedriver.exe' file's location in my computer. As
# The error, 'chromedriver' executable needs to be in path' kept appearing, so I added.
driver.get('https://groceries.morrisons.com/browse/fruit-veg-176738?display=500')
#Scrolling the page every 2 second to the end of the page
last_height = driver.execute_script("return document.body.scrollHeight")
h=0
while h<last_height:
h += 450
time.sleep(2)
driver.execute_script(f"window.scrollTo(0, {h});")
print('\r', "Wait... Parsing", int(h/last_height*100), "%" , end='')
html = driver.page_source
soup=BeautifulSoup(html, 'lxml')
# find the source code of item list.
list_source=soup.find('div', class_='main-column')
# grab all the urls using for loop
url_list=[]
for url_source in list_source.find_all('div', class_='fop-contentWrapper'):
url_tail=url_source.a.attrs['href']
url='https://groceries.morrisons.com/'+url_tail
url_list.append(url)
Upvotes: 0
Reputation: 11
Problem may be in the loading of the page. As I understand it only gets the "visible" part of the page, since more items load as you scroll down. To scrape the full page by requests you have to monitor page network activity (F12 - Network in Chrome) to see what request does the page do when you scroll down. There is my solution below in Selenium to scrape the full page. It scrolls down the page every 2 second to the end of the page to load every item in this page afterwhat it will parse by using bs4.
In [69]: from bs4 import BeautifulSoup
...: from selenium import webdriver
...: import time
...: driver = webdriver.Chrome()
...: driver.get('https://groceries.morrisons.com/browse/fruit-veg-176738?display=500')
...:
...: #Scrolling the page every 2 second to the end of the page
...: last_height = driver.execute_script("return document.body.scrollHeight")
...: h=0
...: while h<last_height:
...: h += 450
...: time.sleep(2)
...: driver.execute_script(f"window.scrollTo(0, {h});")
...:
...: print('\r', "Wait... Parsing", int(h/last_height*100), "%" , end='')
...:
...: html = driver.page_source
...: soup=BeautifulSoup(html, 'lxml')
...: list_source=soup.find('div', class_='main-column')
...: url_source=list_source.find('div', class_='fop-contentWrapper')
...: url_tail=url_source.a.attrs['href']
...: url='https://groceries.morrisons.com/'+url_tail
...: url_list=[]
...: len(list_source.find_all('div', class_='fop-contentWrapper'))
Wait... Parsing 100 %
This is the result:
Out[69]: 439
Correct me, please, if I am wrong.
Upvotes: 1
Reputation: 386
@CODPUD is correct, the entire web page is not loading. Using a requests.get does not render javascript.
In order to test this (on Chrome), visit your target web page, Open Chrome Dev Tools(F12), Click "Console" and Ctrl+Shift+P to pull up command window.
Next, type in "Disable Javascript" and select that option when it shows up. Now, Ctrl+R to refresh the page, and this is the "View" that your web-scraper gets. Notice that as you scroll down, the product elements are blank.
CODPUD's solution is correct, though you can also just scroll to the bottom of the webpage and time.sleep for 2 seconds to ensure everything loads.
Upvotes: 0