Reputation: 19

Issue webscraping a linked header with Beautiful Soup

I am running into an issue pulling in the human readable header name from this table in an html document. I can pull in the id, but my trouble comes when trying to pull in the correct header between the '>... I am not sure what I need to do in this instance... Below is my code. It all runs except for the last for loop.

# Import libraries 

import requests
from bs4 import BeautifulSoup
from pprint import pprint
import pandas as pd
import numpy as np

# Pull the HTML link into a local file or buffer
#   and then parse with the BeautifulSoup library
# ------------------------------------------------

url = 'https://web.dsa.missouri.edu/static/mirror_sites/factfinder.census.gov/bkmk/table/1.0/en/GEP/2014/00A4/0100000US.html'
r = requests.get(url)

#print('Status: ' + str(r.status_code))
#print(requests.status_codes._codes[200])

soup = BeautifulSoup(r.content, "html")
table = soup.find(id='data')  

#print(table)
# Convert the data into a list of dictionaries
#   or some other structure you can convert into
#   pandas Data Frame
# ------------------------------------------------
trs = table.find_all('tr')
#print(trs)

header_row = trs[0]
#print(header_row)

names = [] 

for column in header_row.find_all('th'):
    names.append(column.attrs['id'])
#print(names)

db_names = []
for column in header_row.find_all('a'):
    db_names.append(column.attrs['data-vo-id']) # ISSUE ARISES HERE!!!

    

print(db_names)

Upvotes: 1

Answers (2)

QHarr

Reputation: 84465

Let pandas read_html do the work for you, and simply specify the table id to find:

from pandas import read_html as rh

table = rh('https://web.dsa.missouri.edu/static/mirror_sites/factfinder.census.gov/bkmk/table/1.0/en/GEP/2014/00A4/0100000US.html', attrs = {'id': 'data'})[0]

Upvotes: 1

techwreck

Reputation: 53

Hey you can try something like this :

soup = BeautifulSoup(r.content, "html")
table = soup.findAll('table', {'id':'data'})  

trs = table[0].find_all('tr')
#print(trs)
names = []
for row in trs[:1]:
    td = row.find_all('td')
    data_row_txt_list = [td_tag.text.strip() for td_tag in row]
    header_row = data_row_txt_list
    for column in header_row:
        names.append(column)

Upvotes: 0

Issue webscraping a linked header with Beautiful Soup

Answers (2)

Related Questions