Unable to separate certain fields from each container of ingredients

Question

I'm trying to separate three 3 fields, as in name,unit, and measure out of some ingredient containers from a webpage. I used BeautifulSoup to parse the ingredient containers and then re module to separate unit and measure. This is the portion in that site I'm interested in grabbing the three fields from.

This is how I've tried so far:

import re
import requests
from bs4 import BeautifulSoup

link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'

def get_content(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("ul.ingredient > li"):
        ingr_container = item.get_text(strip=True)
        ingr_unit_container = re.search(r"[\d.⁄a-z]+",ingr_container).group(0)
        ingr_name = re.sub(ingr_unit_container,"",ingr_container).strip()
        ingr_unit = re.sub(r"[a-z]+","",ingr_unit_container).strip()
        ingr_measure = re.sub(r"[\d.⁄]+","",ingr_unit_container).strip()
        yield ingr_name,ingr_unit,ingr_measure

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
        for item in get_content(s,link):
            print(item)

Ingredient containers are like:

500g potato gnocchi
2 tbs extra virgin olive oil
Finely grated zest and juice of 1 lemon
1⁄2 bunch basil, leaves picked
1 tbs finely chopped rosemary, plus fried rosemary leaves to serve
2 garlic cloves, crushed
50g grated pecorino, (or parmesan) plus extra to serve
50g roasted and chopped walnuts, plus extra to serve
100ml extra virgin olive oil

Current output the script produces from the above containers:

('potato gnocchi', '500', 'g')
('tbs extra virgin olive oil', '2', '')
('F grated zest and juice of 1 lemon', '', 'inely')
('bunch basil, leaves picked', '1⁄2', '')
('tbs finely chopped rosemary, plus fried rosemary leaves to serve', '1', '')
('garlic cloves, crushed', '2', '')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')

Expected output:

('potato gnocchi', '500', 'g')
('extra virgin olive oil', '2', 'tbs')
('Finely grated zest and juice of', '1', 'lemon')
('basil, leaves picked', '1⁄2', 'bunch')
('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
('cloves, crushed', '2', 'garlic')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')

SIM · Accepted Answer

I'm nowhere close to good at regex. However, I find the following implementation working:

import re
import requests
from bs4 import BeautifulSoup

link = 'https://www.delicious.com.au/recipes/gnocchi-walnut-rosemary-pecorino-pesto/1b0defa9-53c8-4e9c-8c93-fb96a5348b31?r=recipes/gallery/opvo6a3l'

def get_content(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("ul.ingredient > li"):
        ingr_container = item.get_text(strip=True)
        unit_container = re.search(r'[\d.⁄]+\s*?[a-zA-Z]+\s*?',ingr_container).group(0)
        ingr_name = ingr_container.replace(unit_container,"").strip()
        ingr_unit = re.search(r'[\d.⁄]+',unit_container).group(0)
        ingr_measure = unit_container.replace(ingr_unit,"").strip()
        yield ingr_name,ingr_unit,ingr_measure

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
        for item in get_content(s,link):
            print(item)

Output:

('potato gnocchi', '500', 'g')
('extra virgin olive oil', '2', 'tbs')
('Finely grated zest and juice of', '1', 'lemon')
('basil, leaves picked', '1⁄2', 'bunch')
('finely chopped rosemary, plus fried rosemary leaves to serve', '1', 'tbs')
('cloves, crushed', '2', 'garlic')
('grated pecorino, (or parmesan) plus extra to serve', '50', 'g')
('roasted and chopped walnuts, plus extra to serve', '50', 'g')
('extra virgin olive oil', '100', 'ml')

Unable to separate certain fields from each container of ingredients

Answers (2)

Related Questions