Tarzan
Tarzan

Reputation: 124

Python Web Scraping using BS

I have a web scraping program that gets multiple pages, but I have to set the while loop to a number. I want to make a condition that stops the loop once it reaches the last page or recognizes there are no more items to scrape. Assume I don't know how many pages exist. How do I change the while loop condition to make it stop without putting a random number?

import requests
from bs4 import BeautifulSoup
import csv

filename="output.csv"
f=open(filename, 'w', newline="",encoding='utf-8')
headers="Date, Location, Title, Price\n"
f.write(headers)

i=0
while i<5000:
    if i==0:
        page_link="https://portland.craigslist.org/search/sss?query=xbox&sort=date"
    else:
        page_link="https://portland.craigslist.org/search/sss?s={}&query=xbox&sort=date".format(i)
    res=requests.get(page_link)
    soup=BeautifulSoup(res.text,'html.parser')
    for container in soup.select('.result-info'):
        date=container.select('.result-date')[0].text
        try:
            location=container.select('.result-hood')[0].text
        except:
            try:
                location=container.select('.nearby')[0].text 
            except:
                location='NULL'
        title=container.select('.result-title')[0].text
        try:
            price=container.select('.result-price')[0].text
        except:
            price="NULL"
        print(date,location,title,price)
        f.write(date+','+location.replace(","," ")+','+title.replace(","," ")+','+price+'\n')
    i+=120
f.close()

Upvotes: 0

Views: 163

Answers (1)

furas
furas

Reputation: 142651

I use while True to run endless loop and break to exit when there is no data

    data = soup.select('.result-info')
    if not data:
        print('END: no data:')
        break

I use module csv to save data so I don't have to use replace(","," ").
It will put text in " " if there is , in text.

s={} can be in any place after ? so I put it at the end to make code more readable.

Portal gives first page even if you use s=0 so I don't have to check i == 0
(BTW: in my code it has more readable name offset)

Full code.

import requests
from bs4 import BeautifulSoup
import csv

filename = "output.csv"

f = open(filename, 'w', newline="", encoding='utf-8')

csvwriter = csv.writer(f)

csvwriter.writerow( ["Date", "Location", "Title", "Price"] )

offset = 0

while True:
    print('offset:', offset)

    url = "https://portland.craigslist.org/search/sss?query=xbox&sort=date&s={}".format(offset)

    response = requests.get(url)
    if response.status_code != 200:
        print('END: request status:', response.status)
        break

    soup = BeautifulSoup(response.text, 'html.parser')

    data = soup.select('.result-info')
    if not data:
        print('END: no data:')
        break

    for container in data:
        date = container.select('.result-date')[0].text

        try:
            location = container.select('.result-hood')[0].text
        except:
            try:
                location = container.select('.nearby')[0].text 
            except:
                location = 'NULL'
        #location = location.replace(","," ") # don't need it with `csvwriter`

        title = container.select('.result-title')[0].text

        try:
            price = container.select('.result-price')[0].text
        except:
            price = "NULL"
        #title.replace(",", " ") # don't need it with `csvwriter`

        print(date, location, title, price)

        csvwriter.writerow( [date, location, title, price] )

    offset += 120

f.close()

Upvotes: 1

Related Questions