Reputation: 164
I've written a script in python using requests module along with BeautifulSoup libary and re module to scoop a script in which nicely formatted json content is available. The thing is I like to use re to stand out that very portion out of the total messy script.
That script is within source code containing var masterCompanyData =
.
This is how the script with json content looks like (can be seen executing the following script):
import re
import requests
from bs4 import BeautifulSoup
url = 'https://conference.iste.org/2019/exhibitors/floorplan.php'
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
script = soup.select_one("script:contains('masterCompanyData')").text
# p = re.compile(r'masterCompanyData = (.*);')
# jsonContent = p.findall(script)
# print(jsonContent)
print(script)
String manipulation that helped me scoop that:
items = soup.select_one("script:contains('masterCompanyData = ')").text.split("masterCompanyData = ")[1].split("Holder for the current zoom value")[0].split("/**")[0].replace(";","").strip()
As I've successfully dug out that portion using string manipulation, I don't wish to go that way; rather, I like to extract that json content using regex but I get empty list.
How can I get that json content using regex?
Upvotes: 1
Views: 57
Reputation: 84465
Try the following regex
import requests
import re
import json
r = requests.get('https://conference.iste.org/2019/exhibitors/floorplan.php')
p1 = re.compile(r'var masterCompanyData = (.*?);\n\n\n', re.DOTALL)
item = p1.findall(r.text)[0]
data = json.loads(item)
Using your idea:
import requests
import re
import json
from bs4 import BeautifulSoup as bs
r = requests.get('https://conference.iste.org/2019/exhibitors/floorplan.php')
p1 = re.compile(r'var masterCompanyData = (.*?);\n\n\n', re.DOTALL)
soup = bs(r.content, 'lxml')
script = soup.select_one("script:contains('masterCompanyData')").text
string = p1.findall(script)[0]
x = json.loads(string)
Upvotes: 1
Reputation: 1905
import json
import requests
from bs4 import BeautifulSoup
url = 'https://conference.iste.org/2019/exhibitors/floorplan.php'
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
# p = re.compile(r'masterCompanyData = (.*);')
# jsonContent = p.findall(script)
# print(jsonContent)
for s in soup.findAll('script'):
if 'var masterCompanyData' in str(s):
finalstr = ''
for line in str(s).split('\n'):
if 'var masterCompanyData' in line:
finalstr = line.split('=')[-1]
continue
if line[-2:] == '};' and finalstr:
finalstr += line[:-1]
break
if finalstr:
finalstr+=line
break
finalstr
is now a string containing the desired JSON. If you want, you can do this after the loop:
import json
dictWithJSON = json.loads(finalstr)
Upvotes: 0