Reputation: 164
I'm trying to separate three 3 fields, as in name
,unit
, and measure
out of some ingredient containers from a webpage. I used BeautifulSoup to parse the ingredient containers and then re module to separate unit
and measure
. This is the portion in that site I'm interested in grabbing the three fields from.
This is how I've tried so far:
import re
import requests
from bs4 import BeautifulSoup
link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'
def get_content(s,link):
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("ul.ingredient > li"):
ingr_container = item.get_text(strip=True)
ingr_unit_container = re.search(r"[\d.⁄a-z]+",ingr_container).group(0)
ingr_name = re.sub(ingr_unit_container,"",ingr_container).strip()
ingr_unit = re.sub(r"[a-z]+","",ingr_unit_container).strip()
ingr_measure = re.sub(r"[\d.⁄]+","",ingr_unit_container).strip()
yield ingr_name,ingr_unit,ingr_measure
if __name__ == '__main__':
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
for item in get_content(s,link):
print(item)
Ingredient containers are like:
500g potato gnocchi
2 tbs extra virgin olive oil
Finely grated zest and juice of 1 lemon
1⁄2 bunch basil, leaves picked
1 tbs finely chopped rosemary, plus fried rosemary leaves to serve
2 garlic cloves, crushed
50g grated pecorino, (or parmesan) plus extra to serve
50g roasted and chopped walnuts, plus extra to serve
100ml extra virgin olive oil
Current output the script produces from the above containers:
('potato gnocchi', '500', 'g')
('tbs extra virgin olive oil', '2', '')
('F grated zest and juice of 1 lemon', '', 'inely')
('bunch basil, leaves picked', '1⁄2', '')
('tbs finely chopped rosemary, plus fried rosemary leaves to serve', '1', '')
('garlic cloves, crushed', '2', '')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')
Expected output:
('potato gnocchi', '500', 'g')
('extra virgin olive oil', '2', 'tbs')
('Finely grated zest and juice of', '1', 'lemon')
('basil, leaves picked', '1⁄2', 'bunch')
('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
('cloves, crushed', '2', 'garlic')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')
Upvotes: 1
Views: 56
Reputation: 22440
I'm nowhere close to good at regex. However, I find the following implementation working:
import re
import requests
from bs4 import BeautifulSoup
link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'
def get_content(s,link):
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("ul.ingredient > li"):
ingr_container = item.get_text(strip=True)
unit_container = re.search(r'[\d.⁄]+\s*?[a-zA-Z]+\s*?',ingr_container).group(0)
ingr_name = ingr_container.replace(unit_container,"").strip()
ingr_unit = re.search(r'[\d.⁄]+',unit_container).group(0)
ingr_measure = unit_container.replace(ingr_unit,"").strip()
yield ingr_name,ingr_unit,ingr_measure
if __name__ == '__main__':
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
for item in get_content(s,link):
print(item)
Output:
('potato gnocchi', '500', 'g')
('extra virgin olive oil', '2', 'tbs')
('Finely grated zest and juice of', '1', 'lemon')
('basil, leaves picked', '1⁄2', 'bunch')
('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
('cloves, crushed', '2', 'garlic')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')
Upvotes: 2
Reputation: 1769
So one solution could be to search for digits inside the text, which is the measure. It becomes a bit tricky, because sometimes the unit is part of the measure, sometimes there is an emtpy space between. But you can catch this up with conditions (there might be a regex-solution, too):
import re
import requests
from bs4 import BeautifulSoup
link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'
def get_content(s,link):
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("ul.ingredient > li"):
ingr_container = item.get_text(strip=True).split()
for index, string in enumerate(ingr_container):
if re.search(r'\d', string): #check for digits, or parts, that contain digits
if not string.isdecimal(): #check if digits and characters are mixed
if not string.isalnum(): #check if it's a "backslash"-unit (e.g. 1/2)
ingr_measure = string
ingr_unit = ingr_container[index+1]
to_remove = [index, index+1] #at this index (indices) the unit and measure is set
break
else: #split digit and characters
for i, char in enumerate(string):
if char.isalpha():
ingr_measure = string[:i]
ingr_unit = string[i:]
to_remove = [index, index]
break
break
else:
ingr_measure = string
ingr_unit = ingr_container[index+1]
to_remove = [index, index+1]
break
ingr_name = ' '.join(ingr_container[:to_remove[0]] + ingr_container[to_remove[1]+1:]) #ingr_name is the whole ingr_container without measure and unit
yield ingr_name, ingr_measure, ingr_unit
if __name__ == '__main__':
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
for item in get_content(s,link):
print(item)
output:
('potato gnocchi', '500', 'g')
('extra virgin olive oil', '2', 'tbs')
('Finely grated zest and juice of', '1', 'lemon')
('basil, leaves picked', '1⁄2', 'bunch')
('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
('cloves, crushed', '2', 'garlic')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')
Upvotes: 1