Reputation: 11
I am using BeautifulSoup to scrape a table off of every URL page for each state namecensus.com/zipcodes/{state_name}. I was able to scrape the data and now am wondering how I can best get the information into a pandas dataFrame.
I was able to successfully populate all parsed table elements into a pandas dataFrame using a for loop but I am stuck on how to create a new field that corresponds to the H1 of the title of each page. I need that information to indicate which state the zip codes I am extracting correspond to.
Please note that there will also be only 51 H1s (US States + DC) and 1000s of cities so I need to ensure the H1 value repeats itself in each row associated with that state and avoid value errors.
The table should look like the below. My code has all fields populating from the table but I need to incorporate the h1 state value.
State | Zip Code | City | County | Land Area(Sq. Meters) | Land Area(Sq. Miles) | Land Area (Sq. K |
---|---|---|---|---|---|---|
H1 value | tr value | tr value | tr value | tr value | tr value | tr value |
Current Code
# import libraries
import requests
import numpy as np
from bs4 import BeautifulSoup
import pandas as pd```
# create an URL object
urls = ['https://namecensus.com/zip-codes/alabama', 'https://namecensus.com/zip-codes/alaska','https://namecensus.com/zip-codes/arizona',
'https://namecensus.com/zip-codes/arizona','https://namecensus.com/zip-codes/arkansas' ]
# scrape elements
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# print titles only
h1 = soup.title
print(h1.get_text())
# print table values only
table1 = soup.find("table")
print(table1.get_text)
# Obtain every title of columns with tag <th>
headers = []
state = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
for j in soup.title:
ref = j.text
state.append(ref)
# Obtain every title of columns with tag <h1> and <th>
headers = []
state = []
for i in table1.find_all('th'):
title = i.text
headers.append(title)
for j in soup.title:
ref = j.text
state.append(ref)
# Create a dataframe
mydata = pd.DataFrame(columns = headers)
# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [i.text for i in row_data]
length = len(mydata)
mydata.loc[length] = row
Upvotes: 0
Views: 34
Reputation: 28630
You're doing an awful lot there to parse a table. Pandas
will parse that for you (and actually uses bs4 under the hood).
import pandas as pd
import requests
# create an URL object
urls = ['https://namecensus.com/zip-codes/alabama', 'https://namecensus.com/zip-codes/alaska','https://namecensus.com/zip-codes/arizona',
'https://namecensus.com/zip-codes/arizona','https://namecensus.com/zip-codes/arkansas', 'https://namecensus.com/zip-codes/new-york' ]
dfs_list = []
# scrape elements
for url in urls:
response = requests.get(url)
state = ' '.join(url.split('/')[-1].split('-')).title()
df = pd.read_html(response.text)[0]
df['State'] = state
dfs_list.append(df)
print(f'Collected: {url.split("/")[-1]}')
df = pd.concat(dfs_list)
Output: first 10 rows:
print(df.head(10).to_markdown())
| | ZIP Code | City | County | Land Area (Sq. Meters) | Land Area (Sq. Miles) | Land Area (Sq. Kilometers) | State |
|---:|-----------:|:---------------------------------------------------------------------------|:------------------------------------------------------|-------------------------:|------------------------:|-----------------------------:|:--------|
| 0 | 30165 | Rome | Chattooga County, Cherokee County, Floyd County | 4205500 | 2 | 4 | Alabama |
| 1 | 31905 | Columbus, Cusseta-Chattahoochee County unified government | Chattahoochee County, Muscogee County, Russell County | 4389310 | 2 | 4 | Alabama |
| 2 | 35004 | Leeds, Margaret, Moody, Trussville | St. Clair County | 46802100 | 18 | 47 | Alabama |
| 3 | 35005 | Adamsville, Birmingham, Forestdale, Graysville, May, Mulga, Sylvan Springs | Jefferson County | 89378400 | 35 | 89 | Alabama |
| 4 | 35006 | Huey, North Johns | Jefferson County, Tuscaloosa County, Walker County | 258505000 | 100 | 259 | Alabama |
| 5 | 35007 | Alabaster, Calera, Helena, Pelham | Shelby County | 96812100 | 37 | 97 | Alabama |
| 6 | 35010 | Alexander City, Hackneyville, Hissop, New Site, Our Town, Ray | Coosa County, Elmore County, Tallapoosa County | 563568000 | 218 | 564 | Alabama |
| 7 | 35013 | Allgood | Blount County | 74951 | 0 | 0 | Alabama |
| 8 | 35014 | Childersburg, Sylacauga, Talladega, Vincent | Talladega County | 247319000 | 95 | 247 | Alabama |
| 9 | 35016 | Arab, Joppa | Blount County, Cullman County, Marshall County | 192558000 | 74 | 193 | Alabama |
Upvotes: 1