smerllo
smerllo

Reputation: 3375

web-scrape: get H4 attributes & href

I am trying to web-scrape a website. But I can get access to the attributes of some fields.

here is the code i used:

import urllib3
from bs4 import BeautifulSoup
import pandas as pd

scrap_list = pd.DataFrame()


for path in range(10): # scroll over the categories
    for path in range(10): # scroll over the pages 
        url = 'https://www.samehgroup.com/index.php?route=product/category'+str(page)+'&'+'path='+ str(path)
        req = urllib3.PoolManager()
        res = req.request('GET', URL)
        soup = BeautifulSoup(res.data, 'html.parser')
        soup.findAll('h4', {'class': 'caption'})
        
        # extract names
        scrap_name = [i.text.strip() for i in soup.findAll('h2', {'class': 'caption'})]
        scrap_list['product_name']=pd.DataFrame(scrap_name,columns =['Item_name'])

        # extract prices
        scrap_list['product_price'] = [i.text.strip() for i in soup.findAll('div', {'class': 'price'})]
        product_price=pd.DataFrame(scrap_price,columns =['Item_price'])

I want an output that provides me with each product and its price. I still can't get that right.

Any help would be very much appreciated.

Upvotes: 0

Views: 246

Answers (1)

Phijiwiji
Phijiwiji

Reputation: 36

I think the problem here was looping through the website pages. I got the code below working by first making a list of urls containing numbered 'paths' corresponding to pages on the website. Then looping through this list and applying a page number to the url. If you wanted to only get all the products from a certain page, this page can be selected from the urlist and by index.

    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    import time

    urlist = []    #create list of usable url's to iterate through, 
    for i in range(1,10):     # 9 pages equal to pages on website
        urlist.append('https://www.samehgroup.com/index.php?route=product/category&path=' + str(i))

    namelist = []
    newprice = []

    for urlunf in urlist:    #first loop to get 'path'
        for n in range(100):    #second loop to get 'pages'. set at 100 to cover website max page at 93
            try:                   #try catches when pages containing products run out. 
                url = urlunf + '&page=' + str(n)
                page = requests.get(url).text
                soup = BeautifulSoup(page, 'html')
                products = soup.find_all('div', class_='caption')
        
                for prod in products:     #loops over returned list of products for names and prices
                    name = prod.find('h4').text
                    newp = prod.find('p', class_='price').find('span', class_='price-new').text
                    namelist.append(name)    #append data to list outside of loop
                    newprice.append(newp)
                time.sleep(2)
            except AttributeError:    #if there are no more products it will move to next page
                pass
    
    df = pd.DataFrame()    #create df and add scraped data
    df['name'] = namelist
    df['price'] = newprice

Upvotes: 1

Related Questions