user6727547
user6727547

Reputation:

saving the scraped data into a csv file

I am iterating through a process in which I direct Python to a website and instruct Python to look for addresses I have in my csv file in the designated website. I want to tell Python to save the results for each individual address values from the website into a csv file.

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import csv


driver = webdriver.Chrome("C:\Python27\Scripts\chromedriver.exe")
chrome = driver.get('https://etrakit.friscotexas.gov/Search/permit.aspx')
with open('C:/Users/thefirstcolumnedited.csv','r') as f:
    addresses = f.readlines()

    for address in addresses:
        driver.find_element_by_css_selector('#cplMain_txtSearchString').clear()       
        driver.find_element_by_css_selector('#cplMain_txtSearchString').send_keys(address)
        driver.find_element_by_css_selector('#cplMain_btnSearch').click()
        time.sleep(5)

    soup = BeautifulSoup(chrome, 'html.parser')

    writer = csv.writer(open('thematchingresults.csv', 'w'))
    writer.writerow(soup)

For example:

 6579 Mountain Sky Rd

The Address value above retrieves five rows of data from the website. How can I tell Beautiful Soup to save results for each address value in the csv file?

Upvotes: 2

Views: 939

Answers (1)

alecxe
alecxe

Reputation: 473803

The idea is to write to the CSV file(s) inside the loop (if you want to produce a single csv file for all the input addresses, use a "append" mode). As far as extracting the results, I'd explicitly wait (time.sleep() is unreliable and usually slower than it should be) for the results table element (element with the id="ctl00_cplMain_rgSearchRslts_ctl00"), then use pandas.read_html() to read the table into a dataframe which is we then conveniently dump into a CSV file via .to_csv():

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# ...

wait = WebDriverWait(driver, 10)

for address in addresses:
    driver.find_element_by_css_selector('#cplMain_txtSearchString').clear()
    driver.find_element_by_css_selector('#cplMain_txtSearchString').send_keys(address)
    driver.find_element_by_css_selector('#cplMain_btnSearch').click()

    # wait for the results table
    table = wait.until(EC.visibility_of_element_located((By.ID, "ctl00_cplMain_rgSearchRslts_ctl00")))

    # make a dataframe and dump the results
    df = pd.read_html(table.get_attribute("outerHTML"))[0]
    with open('thematchingresults.csv', 'a') as f:
        df.to_csv(f)

For a single "6579 Mountain Sky Rd" address, the contents of the thematchingresults.csv after running the script will be:

,Permit Number,Address,Street Name,Applicant Name,Contractor Name,SITE_SUBDIVISION,RECORDID
0,B13-2809,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,SHADDOCK HOMES LTD,SHADDOCK HOMES LTD,PCR - SHERIDAN,MAC:1308050328358768
1,B13-4096,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,MIRAGE CUSTOM POOLS,MIRAGE CUSTOM POOLS,PCR - SHERIDAN,MAC:1312030307087756
2,L14-1640,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,TDS IRRIGATION,TDS IRRIGATION,SHERIDAN,ECON:140506012624706
3,P14-0018,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,MIRAGE CUSTOM POOLS,,SHERIDAN,LCR:1401130949212891
4,ROW14-3205,6579 MOUNTAIN SKY RD,MOUNTAIN SKY RD,Housley Group,Housley Group,,TLW:1406190424422330

Hope this is a good starting point for you.

Upvotes: 1

Related Questions