Diego Delgado
Diego Delgado

Reputation: 155

List is being overwritten

So I am scraping listings off of Craigslist and my list of titles, prices, and dates are being overwritten every time the web driver goes to the next page. In the end, the only data in my .csv file and MongoDB collection are the listings on the last page.

I tried moving the instantiation of the lists but it still overwrites.

the function that extracts listing information from a page

    def extract_post_information(self):
    all_posts = self.driver.find_elements_by_class_name("result-row")

    dates = []
    titles = []
    prices = []

    for post in all_posts:
        title = post.text.split("$")

        if title[0] == '':
            title = title[1]
        else:
            title = title[0]

        title = title.split("\n")
        price = title[0]

        title = title[-1]
        title = title.split(" ")
        month = title[0]
        day = title[1]
        title = ' '.join(title[2:])
        date = month + " " + day

        if not price[:1].isdigit():
            price = "0"
        int(price)

        titles.append(title)
        prices.append(price)
        dates.append(date)

    return titles, prices, dates

The function that runs going to url and going to next page until there is no more next page

def load_craigslist_url(self):
    self.driver.get(self.url)
    while True:
        try:
            wait = WebDriverWait(self.driver, self.delay)
            wait.until(EC.presence_of_element_located((By.ID, "searchform")))
            print("Page is loaded")
            self.extract_post_information()
            WebDriverWait(self.driver, 2).until(
                EC.element_to_be_clickable((By.XPATH, '//*[@id="searchform"]/div[3]/div[3]/span[2]/a[3]'))).click()
        except:
            print("Last page")
            break

My main

if __name__ == "__main__":
filepath = '/home/diego/git_workspace/PyScrape/data.csv'  # Filepath of written csv file
location = "philadelphia"  # Location Craigslist searches
postal_code = "19132"  # Postal code Craigslist uses as a base for 'MILES FROM ZIP'
max_price = "700"  # Max price Craigslist limits the items too
query = "graphics+card"  # Type of item you are looking for
radius = "400"  # Radius from postal code Craigslist limits the search to
# s = 0

scraper = CraigslistScraper(location, postal_code, max_price, query, radius)

scraper.load_craigslist_url()

titles, prices, dates = scraper.extract_post_information()

d = [titles, prices, dates]

export_data = zip_longest(*d, fillvalue='')
with open('data.csv', 'w', encoding="utf8", newline='') as my_file:
    wr = csv.writer(my_file)
    wr.writerow(("Titles", "Prices", "Dates"))
    wr.writerows(export_data)
    my_file.close()
    # scraper.kill()
scraper.upload_to_mongodb(filepath)

what I expect it to do is get all the info from one page, go to next page, get all of thats page info and append to the three lists titles, prices, and dates in the extract_post_information function. Once there are no more next pages, create a list out of those three lists called d (seen in my main function)

Should I put the extract_post_information function in the load_craigslist_url function? Or do I have to tweak where I instantiate the three lists in the extract_post _informtaion function?

Upvotes: 0

Views: 53

Answers (1)

John Gordon
John Gordon

Reputation: 33345

In the load_craigslist_url() function, you're calling self.extract_post_information() without saving the returned information.

Upvotes: 1

Related Questions