MITHU
MITHU

Reputation: 164

Can't dig out nicely formatted json content out of some messy script

I've written a script in python using requests module along with BeautifulSoup libary and re module to scoop a script in which nicely formatted json content is available. The thing is I like to use re to stand out that very portion out of the total messy script.

That script is within source code containing var masterCompanyData =.

Website link

This is how the script with json content looks like (can be seen executing the following script):

import re
import requests
from bs4 import BeautifulSoup

url = 'https://conference.iste.org/2019/exhibitors/floorplan.php'

r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
script = soup.select_one("script:contains('masterCompanyData')").text
# p = re.compile(r'masterCompanyData = (.*);')
# jsonContent = p.findall(script)
# print(jsonContent)
print(script)

String manipulation that helped me scoop that:

items = soup.select_one("script:contains('masterCompanyData = ')").text.split("masterCompanyData = ")[1].split("Holder for the current zoom value")[0].split("/**")[0].replace(";","").strip()

As I've successfully dug out that portion using string manipulation, I don't wish to go that way; rather, I like to extract that json content using regex but I get empty list.

How can I get that json content using regex?

Upvotes: 1

Views: 57

Answers (2)

QHarr
QHarr

Reputation: 84465

Try the following regex

import requests
import re
import json

r = requests.get('https://conference.iste.org/2019/exhibitors/floorplan.php')
p1 = re.compile(r'var masterCompanyData = (.*?);\n\n\n', re.DOTALL)
item = p1.findall(r.text)[0]
data = json.loads(item)

Using your idea:

import requests
import re
import json
from bs4 import BeautifulSoup as bs

r = requests.get('https://conference.iste.org/2019/exhibitors/floorplan.php')
p1 = re.compile(r'var masterCompanyData = (.*?);\n\n\n', re.DOTALL)
soup = bs(r.content, 'lxml')
script = soup.select_one("script:contains('masterCompanyData')").text
string = p1.findall(script)[0]
x = json.loads(string)

Upvotes: 1

101arrowz
101arrowz

Reputation: 1905

import json
import requests
from bs4 import BeautifulSoup

url = 'https://conference.iste.org/2019/exhibitors/floorplan.php'

r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
# p = re.compile(r'masterCompanyData = (.*);')
# jsonContent = p.findall(script)
# print(jsonContent)
for s in soup.findAll('script'):
    if 'var masterCompanyData' in str(s):
        finalstr = ''
        for line in str(s).split('\n'):
            if 'var masterCompanyData' in line:
                finalstr = line.split('=')[-1]
                continue
            if line[-2:] == '};' and finalstr:
                finalstr += line[:-1]
                break
            if finalstr:
                finalstr+=line
        break

finalstr is now a string containing the desired JSON. If you want, you can do this after the loop:

import json
dictWithJSON = json.loads(finalstr)

Upvotes: 0

Related Questions