Jerry
Jerry

Reputation: 77

How to scrape page with pagination with python BeautifulSoup

I'm new in programming and I'm having problems with scraping all pages with python BeautifulSoup. I figured out how to scrape 1st page but I'm lost with how to do all pages.

Here is the code:
#!/usr/bin/python
# -*- encoding: utf-8 -*-
from urllib2 import urlopen
import json
from BeautifulSoup import BeautifulSoup

defaultPage = 1
items = []
url = "https://www.nepremicnine.net/oglasi-prodaja/ljubljana-mesto/stanovanje/%d/"

def getWebsiteContent(page=defaultPage):
    return urlopen(url % (page)).read()

def writeToFile(content):
    file = open("nepremicnine1.json", "w+")
    json.dump(content, file)
    # file.write(content)
    file.close()

def main():

    content = getWebsiteContent(page=defaultPage)
    soup = BeautifulSoup(content)
    posesti = soup.findAll("div", {"itemprop": "itemListElement"})

    for stanovanja in posesti:
        item = {}
        item["Naslov"] = stanovanja.find("span", attrs={"class": "title"}).string
        item["Velikost"] = stanovanja.find("span", attrs={"class": "velikost"}).string
        item["Cena"] = stanovanja.find("span", attrs={"class": "cena"}).string
        item["Slika"] = stanovanja.find("img", src = True)["src"]

        items.append(item)

        writeToFile(items)

main()

So I want to loop through, so url %d will increase by 1 every time, because page is numbered with 2, page with 3 etc.

All help is highly appreciated.

Upvotes: 1

Views: 801

Answers (1)

Gabriel Belini
Gabriel Belini

Reputation: 779

You're not incrementing your defaultPage variable.

The way you're trying to do it is correct. You just have to increment defaultPage variable every time you finish scraping a page

def main():
    while (defaultPage <= numPages)  # Loop through all pages. You also need to define the value of numPages.
    content = getWebsiteContent(page=defaultPage)
    soup = BeautifulSoup(content)
    posesti = soup.findAll("div", {"itemprop": "itemListElement"})

    for stanovanja in posesti:
        item = {}
        item["Naslov"] = stanovanja.find("span", attrs={"class": "title"}).string
        item["Velikost"] = stanovanja.find("span", attrs={"class": "velikost"}).string
        item["Cena"] = stanovanja.find("span", attrs={"class": "cena"}).string
        item["Slika"] = stanovanja.find("img", src = True)["src"]

        items.append(item)

        writeToFile(items)
    defaultPage += 1

I think that this should work

Upvotes: 1

Related Questions