Brandon Jacobson
Brandon Jacobson

Reputation: 159

How to extract a table from a website using BeautifulSoup?

I'm wanting to extract the FIPS code for each county in Louisiana from this website using beautiful soup and create a Pandas Dataframe: https://www.nrcs.usda.gov/wps/portal/nrcs/detail/la/technical/cp/?cid=nrcs143_013697

The columns would be FIPS, Name, and State. I've tried finding by tr, td, and table when I inspect the element, but I don't know how to single out just the main data and then put it into a pandas dataframe. Once I find the specific table, it should be easy to do something like:

if state == 'LA':
     # put data into a dataframe

import requests
from bs4 import BeautifulSoup

url = "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/la/technical/cp/?cid=nrcs143_013697"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
# print(soup)
for county in soup.find_all('table'):
    print(county.text)

Upvotes: 1

Views: 7242

Answers (2)

CodeMonkey
CodeMonkey

Reputation: 23738

There is one table so can iterate over the <tr> elements in that one table.

If want a data frame to include only one particular state then can filter it before adding to a data frame, or filter the data frame of all data for a subset data frame.

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/la/technical/cp/?cid=nrcs143_013697"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
for tr in soup.find('table', class_='data').find_all('tr'):
    row = [td.text for td in tr.find_all('td')]
    # If want to filter out all except LA then can do that here
    if len(row) == 3 and row[2] == 'LA':
        data.append(row)
df = pd.DataFrame(data, columns=['FIPS', 'Name', 'State'])
print(df)

Output:

     FIPS          Name State
0   22001        Acadia    LA
1   22003         Allen    LA
2   22005     Ascension    LA
3   22007    Assumption    LA
4   22009     Avoyelles    LA
..    ...           ...   ...
63  22127          Winn    LA

Upvotes: 2

Andrej Kesely
Andrej Kesely

Reputation: 195408

You can select <table> with class="data" and then use pd.read_html. For example:

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.nrcs.usda.gov/wps/portal/nrcs/detail/la/technical/cp/?cid=nrcs143_013697"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
df = pd.read_html(str(soup.select_one(".data")))[0]
# filter State == 'LA'
print(df[df.State == "LA"].head())

Prints:

       FIPS        Name State
1109  22001      Acadia    LA
1110  22003       Allen    LA
1111  22005   Ascension    LA
1112  22007  Assumption    LA
1113  22009   Avoyelles    LA

Upvotes: 2

Related Questions