henktenk
henktenk

Reputation: 280

Beautifulsoup find_all does not find all

I am currently working on a webcrawler. I want my code to grab the text from all of the urls I crawled. Function getLinks() finds the links i want to grab data from and puts them into an array. The array is currently filled with 12 links like this one: 'http://www.computerstore.nl/product/142504/category-100852/wd-green-wd30ezrx-3-tb.html'

And here is the code of my function that loops over my array with the urls i got from getLinks(), and grabs data from it. So the problem i ran into is that it sometimes returns the text 6 times, sometimes 8 or 10. But not 12 times as it should.

def getSpecs(): 
    i = 0 
    while (i < len(clinks)):
        r = (requests.get(clinks[i]))
        s = (BeautifulSoup(r.content))
        for item in s.find_all("div", {"class" :"productSpecs roundedcorners"}):
            print item.find('h3')
        i = i + 1 

getLinks()
getSpecs()

How do I fix this? Please help.

Thanks in advance!

Upvotes: 2

Views: 1390

Answers (1)

alecxe
alecxe

Reputation: 473763

Here is the improved code with multiple fixes:

  • use requests.Session maintained throughout the the script life cycle
  • use urparse.urljoin() to join URL parts
  • use CSS selectors instead of find_all()
  • improved the way products are being found on the page
  • transformed index-based loops into pythonic loops over list items

The code:

from urlparse import urljoin

from bs4 import BeautifulSoup
import requests

base_url = 'http://www.computerstore.nl'
curl = ["http://www.computerstore.nl/category/100852/interne-harde-schijven.html?6437=19598"]

session = requests.Session()
for url in curl:
    soup = BeautifulSoup(session.get(url).content)
    links = [urljoin(base_url, item['href']) for item in soup.select("div.product-list a.product-list-item--image-link")]

    for link in links:
        soup = BeautifulSoup(session.get(link).content)
        print soup.find('span', itemprop='name').get_text(strip=True)

It grabs every product link, follows it and prints out the product title (12 products):

WD Red WD20EFRX 2 TB
WD Red WD40EFRX 4 TB
WD Red WD30EFRX 3 TB
Seagate Barracuda ST1000DM003 1 TB
WD Red WD10EFRX 1 TB
Seagate Barracuda ST2000DM001 2 TB
Seagate Barracuda ST3000DM001 3 TB
WD Green WD20EZRX 2 TB
WD Red WD60EFRX 6 TB
WD Green WD40EZRX 4 TB
Seagate NAS HDD ST3000VN000 3 TB
WD Green WD30EZRX 3 TB

Upvotes: 2

Related Questions