Reputation: 123
i've built a crawl for a page that has only 1 table and is already setup with columns and such. Pretty straight forward. this website has 3 different tables, broken out in random cells through out. I only need info from the first table. I've created a list of the info i need. Not sure how to organize it and get it to run by pulling urls from a csv file.
if i break it down to just one url i can print the info from the license. But i can't get it to work for multiple urls. i feel like i'm totally over complicating things.
Here are some examples of the urls i'm trying to run:
http://search.ccb.state.or.us/search/business_details.aspx?id=221851
http://search.ccb.state.or.us/search/business_details.aspx?id=221852
http://search.ccb.state.or.us/search/business_details.aspx?id=221853
The code is all jacked up, but here's what i've got
I appreciate any and all help
import csv
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup as BS
from email import encoders
import time
import os
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
def get_page():
contents = []
with open('OR_urls.csv','r') as csvf:
urls = 'csv.reader(csvf)'
r = requests.get(url)
data = {}
data['biz_info_object'] = soup(id='MainContent_contractornamelabel')[0].text.strip()
data['lic_number_object'] = soup(id='MainContent_licenselabel')[0].text.strip()
data['lic_date_object'] = soup(id='MainContent_datefirstlabel')[0].text.strip()
data['lic_status_object'] = soup(id='MainContent_licensestatuslabel')[0].text.strip()
data['lic_exp_object'] = soup(id='MainContent_licenseexpirelabel')[0].text.strip()
data['biz_address_object'] = soup(id='MainContent_addresslabel')[0].text.strip()
data['biz_phone_object'] = soup(id='MainContent_phonelabel')[0].text.strip()
data['biz_address_object'] = soup(id='MainContent_endorsementlabel')[0].text.strip()
with open('OR_urls.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
page = ('get_page')
df1 = pd.read_html(page)
Upvotes: 0
Views: 90
Reputation: 46759
As you say, you appear to have combined a few different scripts. Hopefully the following will help you better understand the required structure.
I assume that your OR_urls.csv
file contains your URLs in the first column. It reads a row at a time from the CSV file and uses the requests.get()
library call to return the webpage. This is then parsed with BeautifulSoup and your various elements are extracted from the page into a dictionary. This is then displayed along with the URL.
from bs4 import BeautifulSoup
import requests
import csv
with open('OR_urls.csv') as f_input:
csv_input = csv.reader(f_input)
for url in csv_input:
r = requests.get(url[0]) # Assume the URL is in the first column
soup = BeautifulSoup(r.text, "html.parser")
data = {}
data['biz_info_object'] = soup.find(id='MainContent_contractornamelabel').get_text(strip=True)
data['lic_number_object'] = soup.find(id='MainContent_licenselabel').get_text(strip=True)
data['lic_date_object'] = soup.find(id='MainContent_datefirstlabel').get_text(strip=True)
data['lic_status_object'] = soup.find(id='MainContent_licensestatuslabel').get_text(strip=True)
data['lic_exp_object'] = soup.find(id='MainContent_licenseexpirelabel').get_text(strip=True)
data['biz_address_object'] = soup.find(id='MainContent_addresslabel').get_text(strip=True)
data['biz_phone_object'] = soup.find(id='MainContent_phonelabel').get_text(strip=True)
data['biz_address_object'] = soup.find(id='MainContent_endorsementlabel').get_text(strip=True)
print(url[0], data)
Giving you the following output:
http://search.ccb.state.or.us/search/business_details.aspx?id=221851 {'biz_info_object': 'ANDREW LLOYD PARRY', 'lic_number_object': '221851', 'lic_date_object': '7/17/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/17/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(802) 779-7180'}
http://search.ccb.state.or.us/search/business_details.aspx?id=221852 {'biz_info_object': 'SHANE MICHAEL DALLMAN', 'lic_number_object': '221852', 'lic_date_object': '7/17/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/17/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(503) 933-5406'}
http://search.ccb.state.or.us/search/business_details.aspx?id=221853 {'biz_info_object': 'INTEGRITY HOMES NW INC', 'lic_number_object': '221853', 'lic_date_object': '7/24/2018', 'lic_status_object': 'Active', 'lic_exp_object': '7/24/2020', 'biz_address_object': 'Residential General Contractor', 'biz_phone_object': '(503) 522-6055'}
You can further improve this by creating a list for all the IDs you want, and using a dictionary comprehension to build it. A csv.DictWriter()
could be used to write the data to a CSV file:
from bs4 import BeautifulSoup
import requests
import csv
objects = (
('biz_info_object', 'MainContent_contractornamelabel'),
('lic_number_object', 'MainContent_licenselabel'),
('lic_date_object', 'MainContent_datefirstlabel'),
('lic_status_object', 'MainContent_licensestatuslabel'),
('lic_exp_object', 'MainContent_licenseexpirelabel'),
('biz_address_object', 'MainContent_addresslabel'),
('biz_phone_object', 'MainContent_phonelabel'),
('biz_address_object', 'MainContent_endorsementlabel'),
)
with open('OR_urls.csv') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.DictWriter(f_output, fieldnames=[name for name, id in objects])
csv_output.writeheader()
for url in csv_input:
r = requests.get(url[0]) # Assume the URL is in the first column
soup = BeautifulSoup(r.text, "html.parser")
data = {name : soup.find(id=id).get_text(strip=True) for name, id in objects}
csv_output.writerow(data)
Upvotes: 1