Creating a dataFrame from multiple URL pages + Beautiful Soup

Question

I am using BeautifulSoup to scrape a table off of every URL page for each state namecensus.com/zipcodes/{state_name}. I was able to scrape the data and now am wondering how I can best get the information into a pandas dataFrame.

I was able to successfully populate all parsed table elements into a pandas dataFrame using a for loop but I am stuck on how to create a new field that corresponds to the H1 of the title of each page. I need that information to indicate which state the zip codes I am extracting correspond to.

Please note that there will also be only 51 H1s (US States + DC) and 1000s of cities so I need to ensure the H1 value repeats itself in each row associated with that state and avoid value errors.

The table should look like the below. My code has all fields populating from the table but I need to incorporate the h1 state value.

State	Zip Code	City	County	Land Area(Sq. Meters)	Land Area(Sq. Miles)	Land Area (Sq. K
H1 value	tr value	tr value	tr value	tr value	tr value	tr value

Current Code

# import libraries
import requests
import numpy as np
from bs4 import BeautifulSoup
import pandas as pd```

# create an URL object
urls = ['https://namecensus.com/zip-codes/alabama', 'https://namecensus.com/zip-codes/alaska','https://namecensus.com/zip-codes/arizona',
       'https://namecensus.com/zip-codes/arizona','https://namecensus.com/zip-codes/arkansas' ]

# scrape elements
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # print titles only
    h1 = soup.title
    print(h1.get_text())
    
    # print table values only 
    table1 = soup.find("table")  
    print(table1.get_text)

# Obtain every title of columns with tag 
headers = []
state = []

for i in table1.find_all('th'):
    title = i.text
    headers.append(title)
for j in soup.title:
    ref = j.text
    state.append(ref)

# Obtain every title of columns with tag  and  
headers = []
state = []

for i in table1.find_all('th'):
    title = i.text
    headers.append(title)
    
for j in soup.title:
    ref = j.text
    state.append(ref)

# Create a dataframe
mydata = pd.DataFrame(columns = headers)

# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(mydata)
    mydata.loc[length] = row

Creating a dataFrame from multiple URL pages + Beautiful Soup

Answers (1)

Related Questions