RobK
RobK

Reputation: 123

parsing 1 table from html that has multiple tables

i've built a crawl for a page that has only 1 table and is already setup with columns and such. Pretty straight forward. this website has 3 different tables, broken out in random cells through out. I only need info from the first table. I've created a list of the info i need. Not sure how to organize it and get it to run by pulling urls from a csv file.

if i break it down to just one url i can print the info from the license. But i can't get it to work for multiple urls. i feel like i'm totally over complicating things.

Here are some examples of the urls i'm trying to run:

http://search.ccb.state.or.us/search/business_details.aspx?id=221851
http://search.ccb.state.or.us/search/business_details.aspx?id=221852
http://search.ccb.state.or.us/search/business_details.aspx?id=221853

The code is all jacked up, but here's what i've got

I appreciate any and all help

import csv
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup as BS
from email import encoders
import time
import os
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase

def get_page():
    contents = []
    with open('OR_urls.csv','r') as csvf:
    urls = 'csv.reader(csvf)'
    r = requests.get(url)


data = {}

data['biz_info_object'] = soup(id='MainContent_contractornamelabel')[0].text.strip()
data['lic_number_object'] = soup(id='MainContent_licenselabel')[0].text.strip()
data['lic_date_object']  = soup(id='MainContent_datefirstlabel')[0].text.strip()
data['lic_status_object']  = soup(id='MainContent_licensestatuslabel')[0].text.strip()
data['lic_exp_object']  = soup(id='MainContent_licenseexpirelabel')[0].text.strip()
data['biz_address_object']  = soup(id='MainContent_addresslabel')[0].text.strip()
data['biz_phone_object']  = soup(id='MainContent_phonelabel')[0].text.strip()
data['biz_address_object']  = soup(id='MainContent_endorsementlabel')[0].text.strip()


with open('OR_urls.csv','r') as csvf: # Open file in read mode
    urls = csv.reader(csvf)
    for url in urls:
        page = ('get_page')
        df1 = pd.read_html(page)

Upvotes: 0

Views: 90

Answers (1)

Martin Evans
Martin Evans

Reputation: 46759

As you say, you appear to have combined a few different scripts. Hopefully the following will help you better understand the required structure.

I assume that your OR_urls.csv file contains your URLs in the first column. It reads a row at a time from the CSV file and uses the requests.get() library call to return the webpage. This is then parsed with BeautifulSoup and your various elements are extracted from the page into a dictionary. This is then displayed along with the URL.

from bs4 import BeautifulSoup
import requests
import csv

with open('OR_urls.csv') as f_input:
    csv_input = csv.reader(f_input)

    for url in csv_input:
        r = requests.get(url[0])        # Assume the URL is in the first column
        soup = BeautifulSoup(r.text, "html.parser")

        data = {}

        data['biz_info_object'] = soup.find(id='MainContent_contractornamelabel').get_text(strip=True)
        data['lic_number_object'] = soup.find(id='MainContent_licenselabel').get_text(strip=True)
        data['lic_date_object']  = soup.find(id='MainContent_datefirstlabel').get_text(strip=True)
        data['lic_status_object']  = soup.find(id='MainContent_licensestatuslabel').get_text(strip=True)
        data['lic_exp_object']  = soup.find(id='MainContent_licenseexpirelabel').get_text(strip=True)
        data['biz_address_object']  = soup.find(id='MainContent_addresslabel').get_text(strip=True)
        data['biz_phone_object']  = soup.find(id='MainContent_phonelabel').get_text(strip=True)
        data['biz_address_object']  = soup.find(id='MainContent_endorsementlabel').get_text(strip=True)

        print(url[0], data)

Giving you the following output:

http://search.ccb.state.or.us/search/business_details.aspx?id=221851 {'biz_info_object': 'ANDREW LLOYD PARRY', 'lic_number_object': '221851', 'lic_date_object': '7/17/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/17/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(802) 779-7180'}
http://search.ccb.state.or.us/search/business_details.aspx?id=221852 {'biz_info_object': 'SHANE MICHAEL DALLMAN', 'lic_number_object': '221852', 'lic_date_object': '7/17/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/17/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(503) 933-5406'}
http://search.ccb.state.or.us/search/business_details.aspx?id=221853 {'biz_info_object': 'INTEGRITY HOMES NW INC', 'lic_number_object': '221853', 'lic_date_object': '7/24/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/24/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(503) 522-6055'}

You can further improve this by creating a list for all the IDs you want, and using a dictionary comprehension to build it. A csv.DictWriter() could be used to write the data to a CSV file:

from bs4 import BeautifulSoup
import requests
import csv

objects = (
    ('biz_info_object', 'MainContent_contractornamelabel'),
    ('lic_number_object', 'MainContent_licenselabel'),
    ('lic_date_object', 'MainContent_datefirstlabel'),
    ('lic_status_object', 'MainContent_licensestatuslabel'),
    ('lic_exp_object', 'MainContent_licenseexpirelabel'),
    ('biz_address_object', 'MainContent_addresslabel'),
    ('biz_phone_object', 'MainContent_phonelabel'),
    ('biz_address_object', 'MainContent_endorsementlabel'),
)

with open('OR_urls.csv') as f_input, open('output.csv', 'w', newline='')  as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.DictWriter(f_output, fieldnames=[name for name, id in objects])
    csv_output.writeheader()

    for url in csv_input:
        r = requests.get(url[0])        # Assume the URL is in the first column
        soup = BeautifulSoup(r.text, "html.parser")
        data = {name : soup.find(id=id).get_text(strip=True) for name, id in objects}        
        csv_output.writerow(data)

Upvotes: 1

Related Questions