Reputation: 133
I want to crawl the product description of the product in the link below.
I tried to crawl by using selenium, but the information is protected by the website so all the information I get by selenium is the same with requests. So to make to script run faster, I crawl it by using requests.
Below is the code :
import requests
from bs4 import BeautifulSoup as BS
res= requests.get("https://www.real.de/product/345246038/")
soup=BS(res.text,'lxml')
code=soup.prettify()
split = code.split("attributes:")
for value in split:
after=value.split(",condition$:b")
for value in after:
if "{default:[{name:" in value:
clean = value.replace(",highlighted:void 0}}","}").replace(": None","")
Here is the string in the variable clean :
I convert the clean into a dictionary :
import yaml
d = yaml.load(clean)
But it is not properly formatted like a dictionary : because not all the words are in the double quote ( "" )
So I use regrex to extract only the word in the string that are not in double quote. Here is the code :
r = re.compile(r'[{,:][a-zA-z]+[:}]', flags=re.I | re.X)
string = r.findall(clean)
ta=[]
for w in string :
m = re.search('[a-zA-z]+', w)
if m:
new = str('"')+m.group(0)+str('"')
ta.append(new)
However. I don't know how to put the words which are in the double quote ("") inside the clean variable again.
Can you help me?
Upvotes: 1
Views: 114
Reputation: 6534
you can try (?!")
that mean match character that not followed by quote
if "{default:[{name:" in value:
clean = value.replace(",highlighted:void 0}}","}").replace(": None","")
# add the lines below
clean = re.sub(r'(\{|,)(?!")(\w+?):', r'\1"\2":', clean)
clean = re.sub(r':(?!")(\w+?)(\}|,)', r':"\1"\2', clean)
jsonData = json.loads(clean)
print(json.dumps(jsonData, indent=2))
Upvotes: 1