Dan
Dan

Reputation: 527

Beautiful Soup parsing HTML containing JSON

Using Python3 and trying to parse NWS weather alerts which appear to contain JSON objects using Beautiful Soup and got this far: BS outputs this (snippet from top of output)

>>> soup.body
<body><p>{
    "@context": [
        "https://geojson.org/geojson-ld/geojson-context.jsonld",
        {
            "@version": "1.1",
            "wx": "https://api.weather.gov/ontology#",
            "@vocab": "https://api.weather.gov/ontology#"
        }
    ],
    "type": "FeatureCollection",
    "features": [
        {
            "id": "https://api.weather.gov/alerts/urn:oid:2.49.0.1.840.0.957a95b11de1ec54b622b137ccf43a662d44061f.001.1",
            "type": "Feature",
            "geometry": null,
            "properties": ....(snip)

From what I understand the "@context" tag indicates that the subsequent lines within braces are JSON data; is that correct?

How do I get at the elements inside the square and curly braces?

BS apparently has a JSON parser but I haven't found any good tutorials about how-to for someone who's a noob to this situation.

Pointers would be most welcome.

Upvotes: 0

Views: 63

Answers (1)

HedgeHog
HedgeHog

Reputation: 25073

Question should be improved by some additional details and as mentioned in the comments it do not look like, that response is plain HTML but rather JSON.

  1. HTML in your soup is wrapping from 'lxml' parser

  2. You do not need beautifulsoup for that task and no it is not a JSON parser.

  3. Instead use .json() on your response -> docs

Example
...
json_data = requests.get('YOUR URL').json()

for i in json_data['features']:
    print(i['id'])

...

Upvotes: 1

Related Questions