Reputation: 155
So I am scraping listings off of Craigslist and my list of titles, prices, and dates are being overwritten every time the web driver goes to the next page. In the end, the only data in my .csv file and MongoDB collection are the listings on the last page.
I tried moving the instantiation of the lists but it still overwrites.
the function that extracts listing information from a page
def extract_post_information(self):
all_posts = self.driver.find_elements_by_class_name("result-row")
dates = []
titles = []
prices = []
for post in all_posts:
title = post.text.split("$")
if title[0] == '':
title = title[1]
else:
title = title[0]
title = title.split("\n")
price = title[0]
title = title[-1]
title = title.split(" ")
month = title[0]
day = title[1]
title = ' '.join(title[2:])
date = month + " " + day
if not price[:1].isdigit():
price = "0"
int(price)
titles.append(title)
prices.append(price)
dates.append(date)
return titles, prices, dates
The function that runs going to url and going to next page until there is no more next page
def load_craigslist_url(self):
self.driver.get(self.url)
while True:
try:
wait = WebDriverWait(self.driver, self.delay)
wait.until(EC.presence_of_element_located((By.ID, "searchform")))
print("Page is loaded")
self.extract_post_information()
WebDriverWait(self.driver, 2).until(
EC.element_to_be_clickable((By.XPATH, '//*[@id="searchform"]/div[3]/div[3]/span[2]/a[3]'))).click()
except:
print("Last page")
break
My main
if __name__ == "__main__":
filepath = '/home/diego/git_workspace/PyScrape/data.csv' # Filepath of written csv file
location = "philadelphia" # Location Craigslist searches
postal_code = "19132" # Postal code Craigslist uses as a base for 'MILES FROM ZIP'
max_price = "700" # Max price Craigslist limits the items too
query = "graphics+card" # Type of item you are looking for
radius = "400" # Radius from postal code Craigslist limits the search to
# s = 0
scraper = CraigslistScraper(location, postal_code, max_price, query, radius)
scraper.load_craigslist_url()
titles, prices, dates = scraper.extract_post_information()
d = [titles, prices, dates]
export_data = zip_longest(*d, fillvalue='')
with open('data.csv', 'w', encoding="utf8", newline='') as my_file:
wr = csv.writer(my_file)
wr.writerow(("Titles", "Prices", "Dates"))
wr.writerows(export_data)
my_file.close()
# scraper.kill()
scraper.upload_to_mongodb(filepath)
what I expect it to do is get all the info from one page, go to next page, get all of thats page info and append to the three lists titles, prices, and dates in the extract_post_information function. Once there are no more next pages, create a list out of those three lists called d (seen in my main function)
Should I put the extract_post_information function in the load_craigslist_url function? Or do I have to tweak where I instantiate the three lists in the extract_post _informtaion function?
Upvotes: 0
Views: 53
Reputation: 33345
In the load_craigslist_url()
function, you're calling self.extract_post_information()
without saving the returned information.
Upvotes: 1