Reputation: 141
I am new to webscraping with BeautifulSoup and would like to extract some information from zalando.de.
I have already adressed the row where my needed information (price, article number, ...) can be found. Is it possible to save this row as an accessible datatype (e.g. dictionary) to extract the information by its key?
from bs4 import BeautifulSoup
import requests
source = requests.get("https://en.zalando.de/carhartt-wip-hooded-chase-sweatshirt-c1422s02x-g13.html?_rfl=de").text
soup = BeautifulSoup(source, "lxml")
scr = soup.find("script", id = "z-vegas-pdp-props").text
Upvotes: 1
Views: 649
Reputation: 2368
Improving on the answer. The following code gives you the required dictionary, from which you can access the desired information given in the question, more easily than relying on the original nested dict.
from bs4 import BeautifulSoup
import requests
import json
source = requests.get("https://en.zalando.de/carhartt-wip-hooded-chase-sweatshirt-c1422s02x-g13.html?_rfl=de").text
soup = BeautifulSoup(source, "lxml")
scr = soup.find("script", id = "z-vegas-pdp-props").text
data = json.loads(scr.lstrip('<![CDATA').rstrip(']>'))
desired_data = dict(data['model']['articleInfo'])
print(desired_data)
The output looks like this.
{'modelId': 'C1422S02X',
'id': 'C1422S02X-G13',
'shopUrl': 'https://en.zalando.de/carhartt-wip-hooded-chase-sweatshirt-c1422s02x-g13.html',
'sizeFits': None,
'commodity_group': {'values': ['2', '2', 'S', '4']},
'active': True,
'name': 'HOODED CHASE - Hoodie - cranberry/gold',
'color': 'cranberry/gold',
'silhouette_code': 'pullover',
'product_group': 'clothing',
'category_tag': 'Sweatshirt',
......
'price': {'currency': 'EUR', 'value': 74.95, 'formatted': '74,95\xa0€'},
......
}
You may jsonify the output again using
json_output = json.dumps(desired_data)
Upvotes: 0
Reputation: 7238
Yes, you can save it as a dictionary (or JSON to be exact). You can use the json
module to convert the string into a json.
The text needs to be converted into a valid json first. You can do that by removing in invalid parts.
from bs4 import BeautifulSoup
import requests
import json
source = requests.get("https://en.zalando.de/carhartt-wip-hooded-chase-sweatshirt-c1422s02x-g13.html?_rfl=de").text
soup = BeautifulSoup(source, "lxml")
scr = soup.find("script", id = "z-vegas-pdp-props").text
data = json.loads(scr.lstrip('<![CDATA').rstrip(']>'))
print(data['layout'])
# cover
Upvotes: 1