Reputation: 1620
I'm trying to snip a embedded json from a webpage and then passing the json object to json.loads()
. First url is okay but when loading the second url it's return error
ValueError: Unterminated string starting at: line 1 column 2078 (char 2077)
here is the code
import requests,json
from bs4 import BeautifulSoup
urls = ['https://www.autotrader.co.uk/dealers/greater-manchester/manchester/williams-landrover-9994',
'https://www.autotrader.co.uk/dealers/warwickshire/stratford-upon-avon/guy-salmon-land-rover-stratford-upon-avon-9965'
]
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
scripts = soup.find_all('script')[0]
data = scripts.text.split("window['AT_APOLLO_STATE'] = ")[1].split(';')[0]
jdata = json.loads(data)
print(jdata)
Upvotes: 0
Views: 98
Reputation: 84465
Reason has been given. You could also regex out appropriate string
import requests,json
urls = ['https://www.autotrader.co.uk/dealers/greater-manchester/manchester/williams-landrover-9994',
'https://www.autotrader.co.uk/dealers/warwickshire/stratford-upon-avon/guy-salmon-land-rover-stratford-upon-avon-9965'
]
p = re.compile(r"window\['AT_APOLLO_STATE'\] =(.*?});", re.DOTALL)
for url in urls:
r = requests.get(url)
jdata = json.loads(p.findall(r.text)[0])
print(jdata)
Missed a }
in the original post.
Upvotes: 1
Reputation: 3328
If you print out scripts.text.split("window['AT_APOLLO_STATE'] = ")[1]
, you will see the follows that includes a ;
right after and enthusiastic
. So you get an invalid json string from scripts.text.split("window['AT_APOLLO_STATE'] = ")[1].split(';')[0]
. And the data ends with and enthusiastic
that is not a valid json
string.
"strapline":"In our state-of-the-art dealerships across the U.K, Sytner Group
represents the world’s most prestigious car manufacturers.
All of our staff are knowledgeable and enthusiastic; making every interaction
special by going the extra mile.",
Upvotes: 2