Alison810
Alison810

Reputation: 11

Creating a dataFrame from multiple URL pages + Beautiful Soup

I am using BeautifulSoup to scrape a table off of every URL page for each state namecensus.com/zipcodes/{state_name}. I was able to scrape the data and now am wondering how I can best get the information into a pandas dataFrame.

I was able to successfully populate all parsed table elements into a pandas dataFrame using a for loop but I am stuck on how to create a new field that corresponds to the H1 of the title of each page. I need that information to indicate which state the zip codes I am extracting correspond to.

Please note that there will also be only 51 H1s (US States + DC) and 1000s of cities so I need to ensure the H1 value repeats itself in each row associated with that state and avoid value errors.

The table should look like the below. My code has all fields populating from the table but I need to incorporate the h1 state value.

State Zip Code City County Land Area(Sq. Meters) Land Area(Sq. Miles) Land Area (Sq. K
H1 value tr value tr value tr value tr value tr value tr value

Current Code

# import libraries
import requests
import numpy as np
from bs4 import BeautifulSoup
import pandas as pd```

# create an URL object
urls = ['https://namecensus.com/zip-codes/alabama', 'https://namecensus.com/zip-codes/alaska','https://namecensus.com/zip-codes/arizona',
       'https://namecensus.com/zip-codes/arizona','https://namecensus.com/zip-codes/arkansas' ]

# scrape elements
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # print titles only
    h1 = soup.title
    print(h1.get_text())
    
    # print table values only 
    table1 = soup.find("table")  
    print(table1.get_text)

# Obtain every title of columns with tag <th>
headers = []
state = []

for i in table1.find_all('th'):
    title = i.text
    headers.append(title)
for j in soup.title:
    ref = j.text
    state.append(ref)

# Obtain every title of columns with tag <h1> and <th> 
headers = []
state = []

for i in table1.find_all('th'):
    title = i.text
    headers.append(title)
    
for j in soup.title:
    ref = j.text
    state.append(ref)

# Create a dataframe
mydata = pd.DataFrame(columns = headers)

# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(mydata)
    mydata.loc[length] = row

Upvotes: 0

Views: 34

Answers (1)

chitown88
chitown88

Reputation: 28630

You're doing an awful lot there to parse a table. Pandas will parse that for you (and actually uses bs4 under the hood).

import pandas as pd
import requests

# create an URL object
urls = ['https://namecensus.com/zip-codes/alabama', 'https://namecensus.com/zip-codes/alaska','https://namecensus.com/zip-codes/arizona',
       'https://namecensus.com/zip-codes/arizona','https://namecensus.com/zip-codes/arkansas', 'https://namecensus.com/zip-codes/new-york' ]

dfs_list = []
# scrape elements
for url in urls:
    response = requests.get(url)
    state = ' '.join(url.split('/')[-1].split('-')).title()
    
    df = pd.read_html(response.text)[0]
    df['State'] = state
    
    dfs_list.append(df)
    print(f'Collected: {url.split("/")[-1]}')
    
df = pd.concat(dfs_list)

Output: first 10 rows:

print(df.head(10).to_markdown())
|    |   ZIP Code | City                                                                       | County                                                |   Land Area (Sq. Meters) |   Land Area (Sq. Miles) |   Land Area (Sq. Kilometers) | State   |
|---:|-----------:|:---------------------------------------------------------------------------|:------------------------------------------------------|-------------------------:|------------------------:|-----------------------------:|:--------|
|  0 |      30165 | Rome                                                                       | Chattooga County, Cherokee County, Floyd County       |                  4205500 |                       2 |                            4 | Alabama |
|  1 |      31905 | Columbus, Cusseta-Chattahoochee County unified government                  | Chattahoochee County, Muscogee County, Russell County |                  4389310 |                       2 |                            4 | Alabama |
|  2 |      35004 | Leeds, Margaret, Moody, Trussville                                         | St. Clair County                                      |                 46802100 |                      18 |                           47 | Alabama |
|  3 |      35005 | Adamsville, Birmingham, Forestdale, Graysville, May, Mulga, Sylvan Springs | Jefferson County                                      |                 89378400 |                      35 |                           89 | Alabama |
|  4 |      35006 | Huey, North Johns                                                          | Jefferson County, Tuscaloosa County, Walker County    |                258505000 |                     100 |                          259 | Alabama |
|  5 |      35007 | Alabaster, Calera, Helena, Pelham                                          | Shelby County                                         |                 96812100 |                      37 |                           97 | Alabama |
|  6 |      35010 | Alexander City, Hackneyville, Hissop, New Site, Our Town, Ray              | Coosa County, Elmore County, Tallapoosa County        |                563568000 |                     218 |                          564 | Alabama |
|  7 |      35013 | Allgood                                                                    | Blount County                                         |                    74951 |                       0 |                            0 | Alabama |
|  8 |      35014 | Childersburg, Sylacauga, Talladega, Vincent                                | Talladega County                                      |                247319000 |                      95 |                          247 | Alabama |
|  9 |      35016 | Arab, Joppa                                                                | Blount County, Cullman County, Marshall County        |                192558000 |                      74 |                          193 | Alabama |

Upvotes: 1

Related Questions