Reputation: 75
I scraped a website for the application/ld+json and it returns json, and I want to convert the string to a python dictionary and it doesn't seem to be working. In the terminal i get the error JSONDecodeError("Expecting value", s, err.value) from None. I'm relatively new to working with JSON so I might have made a dumb mistake, but everything I found on stack overflow didn't work. Any help would be greatly appreciated, and thank you for taking the time to read my post!
Here is my code
from flask import Flask, render_template
from bs4 import BeautifulSoup
import requests
import json
source = requests.get('https://www.visionlearning.com/en/library/Chemistry/1/Nuclear-Chemistry/59').text
soup = BeautifulSoup(source, 'html.parser')
jsonString = str(soup.find_all('script', type='application/ld+json')[0])
print(json.loads(jsonString))
Upvotes: 0
Views: 234
Reputation: 75
This is what finally worked I added .contents[0] to the end of jsonString
source = requests.get('https://www.visionlearning.com/en/library/Chemistry/1/Nuclear-Chemistry/59')
soup = BeautifulSoup(source.content, 'html.parser')
jsonString = soup.find_all('script', type='application/ld+json')[0].contents[0]
print(json.loads(jsonString))
Thank you for all the help though!
Upvotes: 0
Reputation: 15578
Since you are getting the first value. You don’t have to use .find_all
. .find
will return the first value. Turn it to string with .get_text
or .text
then cast it to json.
from bs4 import BeautifulSoup
import requests
import json
source = requests.get('https://www.visionlearning.com/en/library/Chemistry/1/Nuclear-Chemistry/59').text
soup = BeautifulSoup(source, 'html.parser')
jsonString = soup.find('script', type='application/ld+json')
print(json.loads(jsonString.get_text(strip=True)))
Upvotes: 1
Reputation: 11515
import requests
from bs4 import BeautifulSoup
import json
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = json.loads(soup.find("script").text)
print(target.keys())
main("https://www.visionlearning.com/en/library/Chemistry/1/Nuclear-Chemistry/59")
Output:
dict_keys(['@context', '@type', 'mainEntityOfPage', 'name', 'headline', 'author', 'datePublished', 'dateModified', 'image', 'publisher', 'description', 'keywords', 'inLanguage', 'copyrightHolder', 'copyrightYear'])
Upvotes: 1
Reputation: 1285
If you print out jsonString you will see it includes the <script>
tab, just get the inside content:
jsonString = str(soup.find_all('script', type='application/ld+json')[0].text)
Upvotes: 2