Box
Box

Reputation: 73

Web scraping returning empty dictionary

I'm trying to scrape all the data from this website https://ricetta.it/ricette-secondi using Python-Selenium.

I'd like to put them into a dictionary, as seen from the code below. However, this is just returning an empty list back.

import pprint
detail_recipes = []
for recipe in list_recipes:
  title = ""
  description = ""
  ingredient = ""
  if(len(recipe.find_elements_by_css_selector(".post-title")) > 0):
    title = recipe.find_elements_by_css_selector(".post-title")[0].text
  if(len(recipe.find_elements_by_css_selector(".post-excerpt")) > 0):
    description = recipe.find_elements_by_css_selector(".post-excerpt")[0].text
  if(len(recipe.find_elements_by_css_selector(".nm-ingr")) > 0):
    ingredient = recipe.find_elements_by_css_selector(".nm-ingr")[0].text

  detail_recipes.append({'title': title,
                        'description': description,
                        'ingredient': ingredient
                        })

len(detail_recipes)
pprint.pprint(detail_recipes[0:10])

Upvotes: 2

Views: 167

Answers (1)

imxitiz
imxitiz

Reputation: 3987

You can try this:

import requests
import numpy as np
from bs4 import BeautifulSoup as bs
import pandas as pd

url="https://ricetta.it/ricette-secondi"

page=requests.get(url)
soup=bs(page.content,'lxml')

df={'title': [],'description': [],'ingredient':[]}

for div in soup.find_all("div",class_="post-bordered"):
    df["title"].append(div.find(class_="post-title").text)
    try:
        df["description"].append(div.find(class_="post-excerpt").text)
    except:
        df["description"].append(np.nan)
    i=div.find_all(class_="nm-ingr")
    if len(i)>0:
        df["ingredient"].append([j.text for j in i])
    else:
        df["ingredient"].append(np.nan)

df=pd.DataFrame(df)

df.dropna(axis=0,inplace=True)

print(df)

Output:

                               title  ...                                         ingredient
0       Polpette di pane e formaggio  ...  [uovo, pane, pangrattato, parmigiano, latte, s...
1     Torta 7 vasetti alle melanzane  ...  [uovo, olio, latte, yogurt, farina 00, fecola ...
2  Torta a sole con zucchine e speck  ...  [pasta sfoglia, zucchina, ricotta, uovo, speck...
3                    Pesto di limoni  ...  [limone, pinoli, parmigiano, basilico, prezzem...
4                    Bombe di patate  ...  [patata, farina 00, uovo, parmigiano, sale e p...
5             Polpettone di zucchine  ...  [zucchina, uovo, parmigiano, pangrattato, pros...
6                  Insalata di pollo  ...  [petto di pollo, zucchina, pomodorino, insalat...
7                      Club sandwich  ...  [pane, petto di pollo, pomodoro, lattuga, maio...
8                Crostata di verdure  ...  [farina 00, burro, acqua, sale, zucchina, pomo...
9              Pesto di barbabietola  ...  [barbabietola, parmigiano, pinoli, olio, sale,...

[10 rows x 3 columns]

I don't know if you use these library or not, but that website doesn't uses javascript to load data, so we can scrape that website using requests and bs4. Most of the people prefer to use these library, if website doesn't uses javascript to load data. It is easy and faster then selenium. And for showing/displaying data I am using pandas with is also preferable library for working on table like data. It exactly print data in table like structure and you can save that scraped data in csv, excel file also.
If you want to scrape all of the data from next page also then try this:

df={'title': [],'description': [],'ingredient':[]}

for i in range(0,108):
    url=f"https://ricetta.it/ricette-secondi?page={i}"
    page=requests.get(url)
    soup=bs(page.content,'lxml')

    for div in soup.find_all("div",class_="post-bordered"):
        df["title"].append(div.find(class_="post-title").text)
        try:
            df["description"].append(div.find(class_="post-excerpt").text)
        except:
            df["description"].append(np.nan)
        i=div.find_all(class_="nm-ingr")
        if len(i)>0:
            df["ingredient"].append([j.text for j in i])
        else:
            df["ingredient"].append(np.nan)

It will scrape all of the 107 pages of data from that website.

You can save this df to csv or excel file by using :

df.to_csv("<filename.csv>")
# or for excel:
df.to_excel("<filename.xlsx>")

Edit :
As you ask you want to scrape, link of all recipes, I have figure out two things, first just replace space of titles by - and that is the link for that recipe and another is scrape link from there, for that you can use this piece of code:

div.find(class_="post-title")["href"]

It will return the link of that recipe. And for another approach you can do this:

df["links"]=df["title"].apply(lambda x: "https://ricetta.it/"+x.replace(" ","-").lower())
#.lower() is just to not make like a random text but it you remove it also it works.

But I personally suggest you just to scrape link from website cuz while making link own our own we may made mistakes.

Upvotes: 2

Related Questions