qualitaetsmuell
qualitaetsmuell

Reputation: 141

Get Information with BeautifulSoup and make it extractable

I am new to webscraping with BeautifulSoup and would like to extract some information from zalando.de.

I have already adressed the row where my needed information (price, article number, ...) can be found. Is it possible to save this row as an accessible datatype (e.g. dictionary) to extract the information by its key?

from bs4 import BeautifulSoup
import requests

source = requests.get("https://en.zalando.de/carhartt-wip-hooded-chase-sweatshirt-c1422s02x-g13.html?_rfl=de").text
soup = BeautifulSoup(source, "lxml")
scr = soup.find("script", id = "z-vegas-pdp-props").text

Upvotes: 1

Views: 649

Answers (2)

codeslord
codeslord

Reputation: 2368

Improving on the answer. The following code gives you the required dictionary, from which you can access the desired information given in the question, more easily than relying on the original nested dict.

from bs4 import BeautifulSoup
import requests
import json

source = requests.get("https://en.zalando.de/carhartt-wip-hooded-chase-sweatshirt-c1422s02x-g13.html?_rfl=de").text
soup = BeautifulSoup(source, "lxml")
scr = soup.find("script", id = "z-vegas-pdp-props").text

data = json.loads(scr.lstrip('<![CDATA').rstrip(']>'))
desired_data = dict(data['model']['articleInfo'])
print(desired_data)

The output looks like this.

{'modelId': 'C1422S02X',
 'id': 'C1422S02X-G13',
 'shopUrl': 'https://en.zalando.de/carhartt-wip-hooded-chase-sweatshirt-c1422s02x-g13.html',
 'sizeFits': None,
 'commodity_group': {'values': ['2', '2', 'S', '4']},
 'active': True,
 'name': 'HOODED CHASE  - Hoodie - cranberry/gold',
 'color': 'cranberry/gold',
 'silhouette_code': 'pullover',
 'product_group': 'clothing',
 'category_tag': 'Sweatshirt',

......
'price': {'currency': 'EUR', 'value': 74.95, 'formatted': '74,95\xa0€'},
......
}

You may jsonify the output again using

json_output = json.dumps(desired_data)

Upvotes: 0

Keyur Potdar
Keyur Potdar

Reputation: 7238

Yes, you can save it as a dictionary (or JSON to be exact). You can use the json module to convert the string into a json.

The text needs to be converted into a valid json first. You can do that by removing in invalid parts.

from bs4 import BeautifulSoup
import requests
import json

source = requests.get("https://en.zalando.de/carhartt-wip-hooded-chase-sweatshirt-c1422s02x-g13.html?_rfl=de").text
soup = BeautifulSoup(source, "lxml")
scr = soup.find("script", id = "z-vegas-pdp-props").text

data = json.loads(scr.lstrip('<![CDATA').rstrip(']>'))
print(data['layout'])
# cover

Upvotes: 1

Related Questions