Reputation: 3
I'm trying to write a program using bs4 (or another method) that can scrape the data from the title pages of specific counties, like this one: https://data.census.gov/profile/Grant_County,_New_Mexico?g=050XX00US35017#populations-and-people
I want to be able to retrieve those nine statistics that are in the title page without having to go through the respective CSVs of origin and re-calculate those values. I have only ever scraped data from simple HTMLs Wikipedia, and am unsure of how to do it when the numbers don't appear to be immediately visible in the html of the webpage (maybe I need some java knowledge here, too...?) Thanks for any tips/nudges in the right direction!
import requests
import pandas as pd
from bs4 import BeautifulSoup
Upvotes: 0
Views: 155
Reputation: 6514
Try something like this:
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.5',
}
params = {
# Grant County, New Mexico
'g': '050XX00US35017',
# Los Angeles County, California
# 'g': '050XX00US06037',
}
response = requests.get(
'https://data.census.gov/api/profile/content/highlights',
params=params,
headers=headers
)
print(json.dumps(response.json(), indent=2))
The county is specified via g
parameter in the params
dictionary. You'll find this identifier in the URL. For example:
This is what the data looks like for Grant County.
{
"selectedProfile": {
"label": "Grant County, New Mexico",
"params": {
"g": "050XX00US35017"
}
},
"highlights": [
{
"format": "number",
"value": "28185",
"topic": "Populations and People",
"label": "Total Population",
"sourceLink": "https://www.census.gov/programs-surveys/decennial-census/about/rdo.html",
"source": "2020 Decennial Census",
"tableId": "DECENNIALPL2020.P1"
},
{
"format": "dollar",
"value": "44895",
"topic": "Income and Poverty",
"label": "Median Household Income",
"sourceLink": "https://www.census.gov/programs-surveys/acs.html",
"source": "2022 American Community Survey 5-Year Estimates",
"tableId": "ACSST5Y2022.S1901"
},
{
"format": "percent",
"value": "26.4",
"topic": "Education",
"label": "Bachelor's Degree or Higher",
"sourceLink": "https://www.census.gov/programs-surveys/acs.html",
"source": "2022 American Community Survey 5-Year Estimates",
"tableId": "ACSST5Y2022.S1501"
},
{
"format": "percent",
"value": "41.9",
"topic": "Employment",
"label": "Employment Rate",
"sourceLink": "https://www.census.gov/programs-surveys/acs.html",
"source": "2022 American Community Survey 5-Year Estimates",
"tableId": "ACSDP5Y2022.DP03"
},
{
"format": "number",
"value": "14584",
"topic": "Housing",
"label": "Total Housing Units",
"sourceLink": "https://www.census.gov/programs-surveys/decennial-census/about/rdo.html",
"source": "2020 Decennial Census",
"tableId": "DECENNIALPL2020.H1"
},
{
"format": "percent",
"value": "5.1",
"topic": "Health",
"label": "Without Health Care Coverage",
"sourceLink": "https://www.census.gov/programs-surveys/acs.html",
"source": "2022 American Community Survey 5-Year Estimates",
"tableId": "ACSST5Y2022.S2701"
},
{
"format": "number",
"value": "534",
"topic": "Business and Economy",
"label": "Total Employer Establishments",
"sourceLink": "https://www.census.gov/programs-surveys/cbp.html",
"source": "2021 Economic Surveys Business Patterns",
"tableId": "CBP2021.CB2100CBP"
},
{
"format": "number",
"value": "11292",
"topic": "Families and Living Arrangements",
"label": "Total Households",
"sourceLink": "https://www.census.gov/programs-surveys/acs.html",
"source": "2022 American Community Survey 5-Year Estimates",
"tableId": "ACSDP5Y2022.DP02"
},
{
"format": "number",
"value": "13466",
"topic": "Race and Ethnicity",
"label": "Hispanic or Latino (of any race)",
"sourceLink": "https://www.census.gov/programs-surveys/decennial-census/about/rdo.html",
"source": "2020 Decennial Census",
"tableId": "DECENNIALDHC2020.P9"
}
]
}
If you want to get the area from the descriptor paragraph (see comment below) then you can add this:
params = {
'g': '050XX00US35017',
'includeHighlights': 'false',
}
response = requests.get(
"https://data.census.gov/api/profile/metadata",
params=params,
headers=headers
)
print(response.json()["header"]["description"])
That will return the complete paragraph text and you'll need to use string operation to extract the area.
Upvotes: 0