Reputation: 280
I am currently working on a webcrawler. I want my code to grab the text from all of the urls I crawled. Function getLinks() finds the links i want to grab data from and puts them into an array. The array is currently filled with 12 links like this one: 'http://www.computerstore.nl/product/142504/category-100852/wd-green-wd30ezrx-3-tb.html'
And here is the code of my function that loops over my array with the urls i got from getLinks()
, and grabs data from it. So the problem i ran into is that it sometimes returns the text 6 times, sometimes 8 or 10. But not 12 times as it should.
def getSpecs():
i = 0
while (i < len(clinks)):
r = (requests.get(clinks[i]))
s = (BeautifulSoup(r.content))
for item in s.find_all("div", {"class" :"productSpecs roundedcorners"}):
print item.find('h3')
i = i + 1
getLinks()
getSpecs()
How do I fix this? Please help.
Thanks in advance!
Upvotes: 2
Views: 1390
Reputation: 473763
Here is the improved code with multiple fixes:
requests.Session
maintained throughout the the script life cycleurparse.urljoin()
to join URL partsCSS selectors
instead of find_all()
The code:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.computerstore.nl'
curl = ["http://www.computerstore.nl/category/100852/interne-harde-schijven.html?6437=19598"]
session = requests.Session()
for url in curl:
soup = BeautifulSoup(session.get(url).content)
links = [urljoin(base_url, item['href']) for item in soup.select("div.product-list a.product-list-item--image-link")]
for link in links:
soup = BeautifulSoup(session.get(link).content)
print soup.find('span', itemprop='name').get_text(strip=True)
It grabs every product link, follows it and prints out the product title (12 products):
WD Red WD20EFRX 2 TB
WD Red WD40EFRX 4 TB
WD Red WD30EFRX 3 TB
Seagate Barracuda ST1000DM003 1 TB
WD Red WD10EFRX 1 TB
Seagate Barracuda ST2000DM001 2 TB
Seagate Barracuda ST3000DM001 3 TB
WD Green WD20EZRX 2 TB
WD Red WD60EFRX 6 TB
WD Green WD40EZRX 4 TB
Seagate NAS HDD ST3000VN000 3 TB
WD Green WD30EZRX 3 TB
Upvotes: 2