Reputation: 159
Below I have code that pulls the records off craigslist. Everything works great but I need to be able to go to the next set of records and repeat the same process but being new to programming I am stuck. From looking at the page code it looks like I should be clicking the arrow button contained in the span here until it contains no href:
<a href="/search/syp?s=120" class="button next" title="next page">next > </a>
I was thinking that maybe this was a loop within a loop but I suppose this could be a try/except situation too. Does that sound right? How would you implement that?
import requests
from urllib.request import urlopen
import pandas as pd
response = requests.get("https://nh.craigslist.org/d/computer-parts/search/syp")
soup = BeautifulSoup(response.text,"lxml")
listings = soup.find_all('li', class_= "result-row")
base_url = 'https://nh.craigslist.org/d/computer-parts/search/'
next_url = soup.find_all('a', class_= "button next")
dates = []
titles = []
prices = []
hoods = []
while base_url !=
for listing in listings:
datar = listing.find('time', {'class': ["result-date"]}).text
dates.append(datar)
title = listing.find('a', {'class': ["result-title"]}).text
titles.append(title)
try:
price = listing.find('span', {'class': "result-price"}).text
prices.append(price)
except:
prices.append('missing')
try:
hood = listing.find('span', {'class': "result-hood"}).text
hoods.append(hood)
except:
hoods.append('missing')
#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})
#write to a file
listings_df.to_csv("craigslist_listings.csv")
Upvotes: 1
Views: 1774
Reputation: 168
For each page you crawl you can find the next url to crawl and add it to a list.
This is how I would do it, without changing your code too much. I added some comments so you understand what's happening, but leave me a comment if you need any extra explanation:
import requests
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
base_url = 'https://nh.craigslist.org/d/computer-parts/search/syp'
base_search_url = 'https://nh.craigslist.org'
urls = []
urls.append(base_url)
dates = []
titles = []
prices = []
hoods = []
while len(urls) > 0: # while we have urls to crawl
print(urls)
url = urls.pop(0) # removes the first element from the list of urls
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
next_url = soup.find('a', class_= "button next") # finds the next urls to crawl
if next_url: # if it's not an empty string
urls.append(base_search_url + next_url['href']) # adds next url to crawl to the list of urls to crawl
listings = soup.find_all('li', class_= "result-row") # get all current url listings
# this is your code unchanged
for listing in listings:
datar = listing.find('time', {'class': ["result-date"]}).text
dates.append(datar)
title = listing.find('a', {'class': ["result-title"]}).text
titles.append(title)
try:
price = listing.find('span', {'class': "result-price"}).text
prices.append(price)
except:
prices.append('missing')
try:
hood = listing.find('span', {'class': "result-hood"}).text
hoods.append(hood)
except:
hoods.append('missing')
#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})
#write to a file
listings_df.to_csv("craigslist_listings.csv")
Edit: You are also forgetting to import BeautifulSoup
in your code, which I added in my response
Edit2: You only need to find the first instance of the next button, as the page can (and in this case it does) have more that one next button.
Edit3: For this to crawl computer parts, base_url
should be changed to the one present in this code
Upvotes: 2
Reputation: 25
This is not a direct answer to how to access the "next" button, but this may be a solution to your problem. When I've webscraped in the past I use the URLs of each page to loop through search results. On craiglist, when you click "next page" the URL changes. There's usually a pattern to this change you can take advantage of. I didn't have to long a look but it looks like the second page of craigslist is: https://nh.craigslist.org/search/syp?s=120, and the third is https://nh.craigslist.org/search/syp?s=240. It looks like that final part of the URL changes by 120 each time. You could create a list of multiples of 120, and then build a for loop to add this value on to the end of each URL. Then you have your current for loop nested in this for loop.
Upvotes: 1