Reputation: 77
I'm new in programming and I'm having problems with scraping all pages with python BeautifulSoup. I figured out how to scrape 1st page but I'm lost with how to do all pages.
Here is the code:
#!/usr/bin/python
# -*- encoding: utf-8 -*-
from urllib2 import urlopen
import json
from BeautifulSoup import BeautifulSoup
defaultPage = 1
items = []
url = "https://www.nepremicnine.net/oglasi-prodaja/ljubljana-mesto/stanovanje/%d/"
def getWebsiteContent(page=defaultPage):
return urlopen(url % (page)).read()
def writeToFile(content):
file = open("nepremicnine1.json", "w+")
json.dump(content, file)
# file.write(content)
file.close()
def main():
content = getWebsiteContent(page=defaultPage)
soup = BeautifulSoup(content)
posesti = soup.findAll("div", {"itemprop": "itemListElement"})
for stanovanja in posesti:
item = {}
item["Naslov"] = stanovanja.find("span", attrs={"class": "title"}).string
item["Velikost"] = stanovanja.find("span", attrs={"class": "velikost"}).string
item["Cena"] = stanovanja.find("span", attrs={"class": "cena"}).string
item["Slika"] = stanovanja.find("img", src = True)["src"]
items.append(item)
writeToFile(items)
main()
So I want to loop through, so url %d will increase by 1 every time, because page is numbered with 2, page with 3 etc.
All help is highly appreciated.
Upvotes: 1
Views: 801
Reputation: 779
You're not incrementing your defaultPage
variable.
The way you're trying to do it is correct. You just have to increment defaultPage
variable every time you finish scraping a page
def main():
while (defaultPage <= numPages) # Loop through all pages. You also need to define the value of numPages.
content = getWebsiteContent(page=defaultPage)
soup = BeautifulSoup(content)
posesti = soup.findAll("div", {"itemprop": "itemListElement"})
for stanovanja in posesti:
item = {}
item["Naslov"] = stanovanja.find("span", attrs={"class": "title"}).string
item["Velikost"] = stanovanja.find("span", attrs={"class": "velikost"}).string
item["Cena"] = stanovanja.find("span", attrs={"class": "cena"}).string
item["Slika"] = stanovanja.find("img", src = True)["src"]
items.append(item)
writeToFile(items)
defaultPage += 1
I think that this should work
Upvotes: 1