Reputation: 117
I am attempting to extract campaign_hearts and postal_code from the code in the script tag here (the entire code is too long to post):
<script>
...
"campaign_hearts":4817,"social_share_total":11242,"social_share_last_update":"2020-01-17T10:51:22-06:00","location":{"city":"Los Angeles, CA","country":"US","postal_code":"90012"},"is_partner":false,"partner":{},"is_team":true,"team":{"name":"Team STEVENS NATION","team_pic_url":"https://d2g8igdw686xgo.cloudfront.net
...
I can identify the script I need with the following code:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests
import re
import json
page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.content, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[0]
However, I'm at a loss for how to extract the values I want. (I'm very new to Python.) This thread recommended the following solution for a similar problem (edited to reflect the html I'm working with).
data = json.loads(all_scripts[0].get_text()[27:])
However, running this produces an error: JSONDecodeError: Expecting value: line 1 column 1 (char 0).
What can I do to extract the values I need now that I have the correct script identified? I have also tried the solutions listed here, but had trouble importing Parser.
Upvotes: 9
Views: 3422
Reputation: 2702
This should be fine for now, I might try to write a pure lxml version or at least improve the searching for the element.
This solution uses regex to get only the JSON data, without the window.initialState =
and semicolon.
import json
import re
import requests
from bs4 import BeautifulSoup
url_1 = "https://www.gofundme.com/f/eric-stevens-care-trust"
req = requests.get(url_1)
soup = BeautifulSoup(req.content, 'lxml')
script_tag = soup.find('script')
raw_json = re.fullmatch(r"window\.initialState = (.+);", script_tag.text).group(1)
json_content = json.loads(raw_json)
Upvotes: 1
Reputation: 3231
Your json.loads
was failing because of the final semicolon. It will work if you use a regex to extract only the object string (excluding the final semicolon).
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests
import re
import json
page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")
soup = BeautifulSoup(page.content, 'html.parser')
all_scripts = soup.find_all('script')
txt = all_scripts[0].get_text()
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])
Upvotes: 1
Reputation: 195408
You can parse the content of <script>
with json
module and then get your values. For example:
import re
import json
import requests
url = 'https://www.gofundme.com/f/eric-stevens-care-trust'
txt = requests.get(url).text
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])
# print( json.dumps(data, indent=4) ) # <-- uncomment this to see all data
print('Campaign Hearts =', data['feed']['campaign']['campaign_hearts'])
print('Postal Code =', data['feed']['campaign']['location']['postal_code'])
Prints:
Campaign Hearts = 4817
Postal Code = 90012
Upvotes: 4
Reputation: 1606
The more libraries you use; the more inefficient a code becomes! Here is a simpler solution-
#This imports the website content.
import requests
url = "https://www.gofundme.com/f/eric-stevens-care-trust"
a = requests.post(url)
a= (a.content)
print(a)
#These will show your data.
campaign_hearts = str(a,'utf-8').split('campaign_hearts":')[1]
campaign_hearts = campaign_hearts.split(',"social_share_total"')[0]
print(campaign_hearts)
postal_code = str(a,'utf-8').split('postal_code":"')[1]
postal_code = postal_code.split('"},"is_partner')[0]
print(postal_code)
Upvotes: 2