Reputation: 143
I am getting a hardtime extracting the data First I need to extract the title post and the posted date of the post here's the url.
Inside view-source there's a script in a json format that contains the data that I needed
Something like this, I crop the other text to minimize the space
<script>
window.__RELAY_STORE__ = {"public_at":"2019-05-22T11:02:43-
04:00","updated_at":"2019-05-22T15:25:20-
04:00","thumbnail_attribution":null,"body":null,"title":"Safety Concerns
Over Tesla's Autopilot from Consumer Reports as Wall Street Turns Bearish"
</script>
I just only need to get the "public_at" and the "title"
And What I have tried is this,
data = response.xpath("//script[contains(., 'window.__RELAY_STORE__')]/text()")
#Locate the script
datatxt = data.extract_first()
#Extract the script
start = datatxt.find('client:') - 2
end = datatxt.find('window.__REDUX_STATE__')
# find start and end of data
json_string = datatxt[start:end]
but when I load it or convert it to python dictionary
data = json.loads(json_string)
I've got an error something like this
Extra data: line 1 column 27284 (char 27283)
Any idea how can I get those data please?
Upvotes: 2
Views: 284
Reputation: 3717
Try to get data in this way:
txt = response.xpath("//script[contains(., 'window.__RELAY_STORE__')]/text()").re_first('window.__RELAY_STORE__ = (.*);')
This will crop name of js-variable and last ;
. So then when I call json.loads(txt)
it gives me valid json.
Upvotes: 2