RJames
RJames

Reputation: 117

How to extract content from <script> using Beautiful Soup

I am attempting to extract campaign_hearts and postal_code from the code in the script tag here (the entire code is too long to post):

<script>
...    
"campaign_hearts":4817,"social_share_total":11242,"social_share_last_update":"2020-01-17T10:51:22-06:00","location":{"city":"Los Angeles, CA","country":"US","postal_code":"90012"},"is_partner":false,"partner":{},"is_team":true,"team":{"name":"Team STEVENS NATION","team_pic_url":"https://d2g8igdw686xgo.cloudfront.net
...

I can identify the script I need with the following code:

from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests 
import re
import json


page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")

soup = BeautifulSoup(page.content, 'html.parser')

all_scripts = soup.find_all('script')
all_scripts[0]

However, I'm at a loss for how to extract the values I want. (I'm very new to Python.) This thread recommended the following solution for a similar problem (edited to reflect the html I'm working with).

data = json.loads(all_scripts[0].get_text()[27:])

However, running this produces an error: JSONDecodeError: Expecting value: line 1 column 1 (char 0).

What can I do to extract the values I need now that I have the correct script identified? I have also tried the solutions listed here, but had trouble importing Parser.

Upvotes: 9

Views: 3422

Answers (4)

AMC
AMC

Reputation: 2702

This should be fine for now, I might try to write a pure lxml version or at least improve the searching for the element.

This solution uses regex to get only the JSON data, without the window.initialState = and semicolon.

import json
import re

import requests
from bs4 import BeautifulSoup

url_1 = "https://www.gofundme.com/f/eric-stevens-care-trust"

req = requests.get(url_1)

soup = BeautifulSoup(req.content, 'lxml')

script_tag = soup.find('script')

raw_json = re.fullmatch(r"window\.initialState = (.+);", script_tag.text).group(1)

json_content = json.loads(raw_json)

Upvotes: 1

eric.christensen
eric.christensen

Reputation: 3231

Your json.loads was failing because of the final semicolon. It will work if you use a regex to extract only the object string (excluding the final semicolon).

from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from time import sleep
import requests 
import re
import json



page = requests.get("https://www.gofundme.com/f/eric-stevens-care-trust")

soup = BeautifulSoup(page.content, 'html.parser')

all_scripts = soup.find_all('script')
txt = all_scripts[0].get_text()
data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195408

You can parse the content of <script> with json module and then get your values. For example:

import re
import json
import requests

url = 'https://www.gofundme.com/f/eric-stevens-care-trust'

txt = requests.get(url).text

data = json.loads(re.findall(r'window\.initialState = ({.*?});', txt)[0])

# print( json.dumps(data, indent=4) )  # <-- uncomment this to see all data

print('Campaign Hearts =', data['feed']['campaign']['campaign_hearts'])
print('Postal Code     =', data['feed']['campaign']['location']['postal_code'])

Prints:

Campaign Hearts = 4817
Postal Code     = 90012

Upvotes: 4

Amit Ghosh
Amit Ghosh

Reputation: 1606

The more libraries you use; the more inefficient a code becomes! Here is a simpler solution-

#This imports the website content.

import requests
url = "https://www.gofundme.com/f/eric-stevens-care-trust"
a = requests.post(url)
a= (a.content)
print(a)

#These will show your data.

campaign_hearts = str(a,'utf-8').split('campaign_hearts":')[1]
campaign_hearts = campaign_hearts.split(',"social_share_total"')[0]
print(campaign_hearts)

postal_code = str(a,'utf-8').split('postal_code":"')[1]
postal_code = postal_code.split('"},"is_partner')[0]
print(postal_code)   

Upvotes: 2

Related Questions