Reputation: 73
I'm trying to scrape all the data from this website https://ricetta.it/ricette-secondi
using Python-Selenium.
I'd like to put them into a dictionary, as seen from the code below. However, this is just returning an empty list back.
import pprint
detail_recipes = []
for recipe in list_recipes:
title = ""
description = ""
ingredient = ""
if(len(recipe.find_elements_by_css_selector(".post-title")) > 0):
title = recipe.find_elements_by_css_selector(".post-title")[0].text
if(len(recipe.find_elements_by_css_selector(".post-excerpt")) > 0):
description = recipe.find_elements_by_css_selector(".post-excerpt")[0].text
if(len(recipe.find_elements_by_css_selector(".nm-ingr")) > 0):
ingredient = recipe.find_elements_by_css_selector(".nm-ingr")[0].text
detail_recipes.append({'title': title,
'description': description,
'ingredient': ingredient
})
len(detail_recipes)
pprint.pprint(detail_recipes[0:10])
Upvotes: 2
Views: 167
Reputation: 3987
You can try this:
import requests
import numpy as np
from bs4 import BeautifulSoup as bs
import pandas as pd
url="https://ricetta.it/ricette-secondi"
page=requests.get(url)
soup=bs(page.content,'lxml')
df={'title': [],'description': [],'ingredient':[]}
for div in soup.find_all("div",class_="post-bordered"):
df["title"].append(div.find(class_="post-title").text)
try:
df["description"].append(div.find(class_="post-excerpt").text)
except:
df["description"].append(np.nan)
i=div.find_all(class_="nm-ingr")
if len(i)>0:
df["ingredient"].append([j.text for j in i])
else:
df["ingredient"].append(np.nan)
df=pd.DataFrame(df)
df.dropna(axis=0,inplace=True)
print(df)
Output:
title ... ingredient
0 Polpette di pane e formaggio ... [uovo, pane, pangrattato, parmigiano, latte, s...
1 Torta 7 vasetti alle melanzane ... [uovo, olio, latte, yogurt, farina 00, fecola ...
2 Torta a sole con zucchine e speck ... [pasta sfoglia, zucchina, ricotta, uovo, speck...
3 Pesto di limoni ... [limone, pinoli, parmigiano, basilico, prezzem...
4 Bombe di patate ... [patata, farina 00, uovo, parmigiano, sale e p...
5 Polpettone di zucchine ... [zucchina, uovo, parmigiano, pangrattato, pros...
6 Insalata di pollo ... [petto di pollo, zucchina, pomodorino, insalat...
7 Club sandwich ... [pane, petto di pollo, pomodoro, lattuga, maio...
8 Crostata di verdure ... [farina 00, burro, acqua, sale, zucchina, pomo...
9 Pesto di barbabietola ... [barbabietola, parmigiano, pinoli, olio, sale,...
[10 rows x 3 columns]
I don't know if you use these library or not, but that website doesn't uses javascript to load data, so we can scrape that website using requests
and bs4
. Most of the people prefer to use these library, if website doesn't uses javascript to load data. It is easy and faster then selenium. And for showing/displaying data I am using pandas
with is also preferable library for working on table like data. It exactly print data in table like structure and you can save that scraped data in csv
, excel file
also.
If you want to scrape all of the data from next page also then try this:
df={'title': [],'description': [],'ingredient':[]}
for i in range(0,108):
url=f"https://ricetta.it/ricette-secondi?page={i}"
page=requests.get(url)
soup=bs(page.content,'lxml')
for div in soup.find_all("div",class_="post-bordered"):
df["title"].append(div.find(class_="post-title").text)
try:
df["description"].append(div.find(class_="post-excerpt").text)
except:
df["description"].append(np.nan)
i=div.find_all(class_="nm-ingr")
if len(i)>0:
df["ingredient"].append([j.text for j in i])
else:
df["ingredient"].append(np.nan)
It will scrape all of the 107 pages of data from that website.
You can save this df
to csv
or excel file
by using :
df.to_csv("<filename.csv>")
# or for excel:
df.to_excel("<filename.xlsx>")
Edit :
As you ask you want to scrape, link of all recipes, I have figure out two things, first just replace space of titles by -
and that is the link for that recipe and another is scrape link from there, for that you can use this piece of code:
div.find(class_="post-title")["href"]
It will return the link of that recipe. And for another approach you can do this:
df["links"]=df["title"].apply(lambda x: "https://ricetta.it/"+x.replace(" ","-").lower())
#.lower() is just to not make like a random text but it you remove it also it works.
But I personally suggest you just to scrape link from website cuz while making link own our own we may made mistakes.
Upvotes: 2