Reputation: 2897
I am trying to retrieve data from a table via beautifulsoup, but somehow my (beginner) syntax is wrong:
from bs4 import BeautifulSoup
import requests
main_url = "https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html"
req = requests.get(main_url)
soup = BeautifulSoup(req.text, "html.parser")
title = soup.find("div", id = "accordionContent5e95581b6e244")
results = {}
for row in title.findAll('tr'):
aux = row.findAll('td')
results[aux[0].string] = aux[1].string
print(results)
This is the relevant code:
<div id="accordionContent5e95581b6e244" class="panel-collapse collapse in">
<div class="panel-body">
<table class="table" width="100%">
<tbody>
<tr>
<th width="170">PZN</th>
<td>00520917</td>
</tr>
<tr>
<th width="170">Anbieter</th>
<td>Hexal AG</td>
</tr>
My goal is to retrieve a dictionary from the th td
cells.
How can this be done in beautifulsoup?
Upvotes: 0
Views: 1314
Reputation: 1710
id
which varies of you want to scrape more pages .aux = row.findAll('td')
this will return list of one item because you are not taking into consideration the th
tags that means aux[1].string
will raise an exception .Here is the code :
from bs4 import BeautifulSoup
import requests
main_url = "https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html"
req = requests.get(main_url)
soup = BeautifulSoup(req.text, "html.parser")
title = soup.find("div", class_="panel-collapse collapse in")
results = {}
for row in title.findAll('tr'):
key = row.find('th')
value = row.find('td')
results[key.text] =value.text.strip()
print(results)
Output:
{'PZN': '00520917', 'Anbieter': 'Hexal AG', 'Packungsgröße': '40\xa0St', 'Produktname': 'ACC akut 600mg Hustenlöser', 'Darreichungsform': 'Brausetabletten', 'Monopräparat': 'ja', 'Wirksubstanz': 'Acetylcystein', 'Rezeptpflichtig': 'nein', 'Apothekenpflichtig': 'ja'}
Upvotes: 0
Reputation: 33384
I would suggest use pandas
to store data in Data Frame
and then import into dictionary
.
import pandas as pd
from bs4 import BeautifulSoup
import requests
main_url = "https://www.apo-in.de/product/acc-akut-600-brausetabletten.24170.html"
req = requests.get(main_url)
soup = BeautifulSoup(req.text, "html.parser")
table=soup.select_one(".panel-body >table")
df=pd.read_html(str(table))[0]
print(df.set_index(0).to_dict('dict'))
Output:
{1: {'Rezeptpflichtig': 'nein', 'Anbieter': 'Hexal AG', 'PZN': '00520917', 'Darreichungsform': 'Brausetabletten', 'Wirksubstanz': 'Acetylcystein', 'Monopräparat': 'ja', 'Packungsgröße': '40\xa0St', 'Apothekenpflichtig': 'ja', 'Produktname': 'ACC akut 600mg Hustenlöser'}}
Upvotes: 1