Lindsay Veazey
Lindsay Veazey

Reputation: 135

Looping issue: BeautifulSoup only collecting some elements per page

I'm crawling through multiple pages to collect some HTML, but it seems that BeautifulSoup is only collecting some random selection of information. I'm also using selenium with geckodriver on Ubuntu 16.04 OS to click through to the next page.

# import libraries
import urllib.request
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import certifi
import urllib3
import pandas as pd 
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import requests

# This URL is ok according to eBay's robots.txt:
urlpage = 'https://www.ebay.com/sch/i.html?_nkw=lululemon&_sacat=15724&rt=nc&LH_Sold=1&LH_Complete=1&_pgn=6'

http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
r = http.request('GET', urlpage)
page = urllib.request.urlopen(urlpage).read()
soup = BeautifulSoup(page, 'html.parser')

# Specify containers
item_containers = soup.find_all('div', {'class': 's-item__info clearfix'})
print(len(item_containers)) # should be about 4 dozen

driver = webdriver.Firefox()

# get web page
driver.get(urlpage)

# Loop through
for container in item_containers:
    # If the item has a summary, then extract...:
        if container.find('h3', class_ = 's-item__title s-item__title--has-tags') is not None:
        # The summary
            summary = container.find('h3', class_ = 's-item__title s-item__title--has-tags').text
            summaries.append(summary)
        # The color
            #color = container.find('span', {'class': 's-item__dynamic s-item__dynamicAttributes2'})
            #colors.append(color)
        # The price
            price = container.find('span', attrs = {'class':'POSITIVE'}).text
            prices.append(price)

            button = driver.find_elements_by_class_name('x-pagination__control')[1]
            button.click()

            driver.refresh()
            time.sleep(20)

        # driver.quit()

There are ~4 dozen elements to collect for each tag I specify per page, but after several pages, I'll only have maybe a dozen. Loop logic is off- please advise, I'm trying to improve my python!

Upvotes: 1

Views: 193

Answers (2)

KunduK
KunduK

Reputation: 33384

You can do that without selenium.Use requests of Beautiful Soup.

from bs4 import BeautifulSoup
import requests
url="https://www.ebay.com/sch/i.html?_nkw=lululemon&_sacat=15724&rt=nc&LH_Sold=1&LH_Complete=1&_pgn=6"
html=requests.get(url).text
soup=BeautifulSoup(html,'html.parser')
summery=[]
price=[]
for item in soup.select('div.s-item__info.clearfix'):
    if item.select_one("h3.s-item__title"):
        summery.append(item.select_one("h3.s-item__title").text)
    if item.select_one("span.s-item__price"):
       price.append(item.select_one("span.s-item__price").text)

print(summery)
print(price)

For pagination you can use while loop ans use the page_number how many page you are after.for example i have provided upto 10 pages.

page_num=1
baseurl="https://www.ebay.com/sch/i.html?_nkw=lululemon&_sacat=15724&rt=nc&LH_Sold=1&LH_Complete=1&_pgn={}"

summery = []
price = []
while page_num<=10:
    html = requests.get(baseurl.format(page_num)).text
    soup = BeautifulSoup(html, 'html.parser')

    for item in soup.select('div.s-item__info.clearfix'):
        if item.select_one("h3.s-item__title"):
            summery.append(item.select_one("h3.s-item__title").text)
        if item.select_one("span.s-item__price"):
            price.append(item.select_one("span.s-item__price").text)

    page_num=page_num+1

print(summery)
print(price)

Upvotes: 1

Jay_jen
Jay_jen

Reputation: 44

Your code is picking up the advertisements:

item_containers = soup.find_all('div', {'class': 's-item__info clearfix'})

The div tag "s-item__info clearfix" is also used for the advertisements shown in the left pane.

Upvotes: 0

Related Questions