Reputation:
I am trying to scrape a recipe site which has its ingredients grouped into separate categories, described by the <strong>
tag in HTML as shown below:
<div class="opskriften">
<p class="h3">Ingrediensliste</p>
<p></p>
<p><strong>Påskeæg med nougat (6 stk)</strong><br>150 g. marcipan <br>ca. 40 g. nougat<br>150 g. mørk chokolade <br>50 g. lys chokolade </p>
I managed to get the ingredients separated into different columns for the amount, unit and ingredient, but I am finding trouble trying to make another column for the content inside the <strong>
tags.
This is the code that I used.
ingredients = soup.find('div', class_='opskriften')
#if len(ingredients.find_all('strong'))>0:
s = f"{ingredients}"
r = re.compile(r"(?P<amount>\d+)\s+(?P<unit>\w+.)\s+(?P<ingredient>.+?(?=<))")
df = pd.DataFrame([m.groupdict() for m in r.finditer(s)])
with open("somefile.csv", 'w') as fh:
df.to_csv(fh)
I tried playing around with the RegEx but couldn't find any solution to make it work.
image of what the website I am scraping off looks like
Upvotes: 1
Views: 247
Reputation: 309
Here i have some suggestions for you. There might be problem with parsing due to language that's why the opening of br tags is getting eliminated
from bs4 import BeautifulSoup
soup_chunk = '''<div class="opskriften">
<p class="h3">Ingrediensliste</p>
<p></p>
<p><strong>Påskeæg med nougat (6 stk)</strong><br>150 g. marcipan <br>ca. 40 g. nougat<br>150 g. mørk chokolade <br>50 g. lys chokolade </p>'''
soup = BeautifulSoup(soup_chunk,'lxml')
requiredData = []
for tags in soup.find_all('p'):
if tags.select('br'):
contents = {}
contents['MainItem'] = tags.select('strong')[0].text
[i.decompose() for i in tags.select('strong')]
contents['SubItems'] = [i.strip().replace("</p>",'') for i in str(tags).split("<br/>") if "<p>" not in i]
requiredData.append(contents)
print(requiredData)
I put the output in list of dict, so it can be used by anywhere.
[{'MainItem': 'Påskeæg med nougat (6 stk)', 'SubItems': ['150 g. marcipan', 'ca. 40 g. nougat', '150 g. mørk chokolade', '50 g. lys chokolade']}]
Upvotes: 1
Reputation: 23753
If all of the div
's look the same you can parse the ingredients with BeautifulSoup. This relies on a <strong>
tag being a child of the <p>
tag that contains all the ingredients:
from bs4 import BeautifulSoup as BS
s = '''<div class="opskriften">
<p class="h3">Ingrediensliste</p>
<p></p>
<p><strong>Påskeæg med nougat (6 stk)</strong><br>150 g. marcipan <br>ca. 40 g. nougat<br>150 g. mørk chokolade <br>50 g. lys chokolade </p>
'''
soup = BS(s,'html.parser')
q = soup.find('div', class_='opskriften')
r = q.find('strong')
ingredients = r.parent
In [13]: for tag in ingredients.childGenerator():
...: if tag.name == 'strong':
...: print(tag.text)
...: elif tag.name == 'br':
...: continue
...: else:
...: print(tag)
...:
Påskeæg med nougat (6 stk)
150 g. marcipan
ca. 40 g. nougat
150 g. mørk chokolade
50 g. lys chokolade
If the <p>
tag that contains all the ingredients is always the last <p>
tag in the div
then you can find it like this.
q = soup.find('div', class_='opskriften')
ingredients = q.find_all('p')[-1]
Upvotes: 1