PostMagne
PostMagne

Reputation: 25

web scraping craigslist appartment prices in python not showing highest cost appartment

It shows the max price for an apartment is $4700 when the max price as I can see is something over a million. Why is it not showing that? What am I doing wrong?

import requests
import re


r = requests.get("http://orlando.craigslist.org/search/apa")
r.raise_for_status()

html = r.text


matches = re.findall(r'<span class="price">\$(\d+)</span>', html)
prices = map(int, matches)


print "Highest price: ${}".format(max(prices))
print "Lowest price: ${}".format(min(prices))
print "Average price: ${}".format(sum(prices)/len(prices))

Upvotes: 1

Views: 134

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180481

Use a html parser bs4 is very easy to use , you can order by price by adding ?sort=pricedsc to the url so the first match will be the max and the last will the last the lowest (for that page):

r = requests.get("http://orlando.craigslist.org/search/apa?sort=pricedsc")
from bs4 import BeautifulSoup

html = r.content

soup = BeautifulSoup(html)
print "Highest price: ${}".format(prices[0])
print "Lowest price: ${}".format(prices[-1])
print "Average price: ${}".format(sum(prices, 0.0)/len(prices))

If you wanted the lowest price you would need to order ascending:

r = requests.get("http://orlando.craigslist.org/search/apa?sort=priceasc")
from bs4 import BeautifulSoup

html = r.content

soup = BeautifulSoup(html)
prices = [int(pr.text.strip("$")) for pr in soup.select("span.price")]
print "Highest price: ${}".format(prices[-1])
print "Lowest price: ${}".format(prices[0])
print "Average price: ${}".format(sum(prices, 0.0)/len(prices))

Now the output is very different:

Highest price: $70
Lowest price: $1
Average price: $34.89

If you want the average for all you need to add more logic. By default you are only seeing 100 of 2500 results but we can change that.

r = requests.get("http://orlando.craigslist.org/search/apa")
from bs4 import BeautifulSoup

html = r.content

soup = BeautifulSoup(html)
prices = [int(pr.text.strip("$")) for pr in soup.select("span.price")]

# link to next 100 results
nxt = soup.select_one("a.button.next")["href"]

# keep looping until we find a page with no next button
while nxt:
    url = "http://orlando.craigslist.org{}".format(nxt)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    # extend prices to our list
    prices.extend([int(pr.text.strip("$")) for pr in soup.select("span.price")])
    nxt = soup.select_one("a.button.next")
    if nxt:
        nxt = nxt["href"]

Which will give you every listing from 1-2500

Upvotes: 1

Related Questions