Ashok Kumar Jayaraman
Ashok Kumar Jayaraman

Reputation: 3085

How to extract json within the html comment tag using BeautifulSoup?

I want to extract json content within the html comment tag using BeautifulSoup.

<script data_id ="dfsfre2323" data_key="23424sfsfsfdafd", type="application/json"><!--
{"employee": {"name":"sonoo", "salary":56000, "married":true}}--></script>]

The output should be as follows

Name: sonoo
Salary: 56000
Married: True

I have tried the following:

from bs4 import BeautifulSoup, Comment
import json
soup = BeautifulSoup(webpage, "html.parser")
data = soup.find("script", {"type":"application/json", data_id ="dfsfre2323" data_key="23424sfsfsfdafd"})                                                                                                       
comment = soup.find(text=lambda text:isinstance(data, Comment))

I don't get nothing in the comment.

Any help appreciated in advance?

Upvotes: 3

Views: 90

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195418

The content inside <script> tag isn't parsed by BeautifulSoup, so your .find(text=...) won't find anything. Convert the script string to BeautifulSoup before .find():

import json
from bs4 import BeautifulSoup, Comment


txt = '''
<script data_id ="dfsfre2323" data_key="23424sfsfsfdafd" type="application/json"><!--
    {"employee": {"name":"sonoo", "salary":56000, "married":true}}
--></script>'''

soup = BeautifulSoup(txt, "html.parser")
data = soup.find("script", {"type":"application/json", 'data_id':"dfsfre2323", 'data_key':"23424sfsfsfdafd"})
comment = BeautifulSoup(data.string, "html.parser").find(text=lambda t: isinstance(t, Comment))

data = json.loads(comment)

print(json.dumps(data, indent=4))

Prints:

{
    "employee": {
        "name": "sonoo",
        "salary": 56000,
        "married": true
    }
}

Upvotes: 1

Related Questions