Reputation: 1
Hi I am trying to scrape some data from a website with Cheerio. Its a menu monday to friday
I found a way to scrape the menu Wednesday to Friday, But I am struggeling with Monday and Tuesday.
The site is not very structured, but maybe someone can give me a clue
Here is the HTML:
<div class="w-full flex-1">
<div class="w-full relative text-base">
<div class="mb-5 last:mb-0">
<div class="flex flex-row mt-8 relative items-baseline print-avoid-inside-break">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong>Mandag</strong></div>
</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"> </div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Kylling club burger - karry mayo - tomat - agurk - salat - løg</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Chicken club burger - curry mayo - tomato - cucumber - salad - onion</div>
<div class="flex flex-row mt-8 relative items-baseline print-avoid-inside-break">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong><br></strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong>Tirsdag</strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong><br></strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Stegte nudler - gris - grønt - koriander - chili - soya - sweet chili</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Fried noodles - pork - vegetables - coriander - chili - soya - sweet chili</div>
</div>
</div>
<div class="flex flex-row mt-8 relative items-baseline print-avoid-inside-break">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong><strong>Onsdag</strong></strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"> </div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Stegt fisk - sauce tartar - ratatouille - rosmarin kartofler</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Fried fish - sauce tartar - ratatouille - rosemary potatoes</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"> </div>
</div>
<div class="flex flex-row mt-8 relative items-baseline print-avoid-inside-break">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong>Torsdag</strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"> </div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Pariserbøf med tilbehør</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Steak a la paris - with sides</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"> </div>
</div>
<div class="flex flex-row mt-8 relative items-baseline print-avoid-inside-break">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong>Fredag</strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong><span data-slate-fragment="JTVCJTdCJTIydHlwZSUyMiUzQSUyMnBhcmFncmFwaCUyMiUyQyUyMmNoaWxkcmVuJTIyJTNBJTVCJTdCJTIydGV4dCUyMiUzQSUyMk1lZGFsam9uJTIwbWVkJTIwYmFjb24lMkMlMjBmbCVDMyVCOGRla2FydG9mbGVyJTJDJTIwYmFndCUyMHRvbWF0JTIwb2clMjBzYWxhdCUyMG1lZCUyMHJldmV0JTIwY2l0cm9uc2thbCUyMiU3RCU1RCU3RCU1RA=="><br></span></strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Frankfurter og spareribs på grill (udenfor) Bagekartofler m. creme fraiche dressing</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">"Sausages and ribs on the grill (outside) baked potatoes with sour creme dressing</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"> </div>
</div>
</div>
</div>
</div>
My typescript code
A little Danish translation :-) Monday=Mandag, Tuesday=Tirsdag, Wednesday=Onsdag, Thursday=Torsdag, Friday=Fredag
AxiosInstance.get(url)
.then((response) => {
let dkMenuMandag: string;
let usMenuMandag: string;
let dkMenuTirsdag: string;
let usMenuTirsdag: string;
let dkMenuOnsdag: string;
let usMenuOnsdag: string;
let dkMenuTorsdag: string;
let usMenuTorsdag: string;
let dkMenuFredag: string;
let usMenuFredag: string;
const html = response.data;
const $ = cheerio.load(html);
let MenuTableRows = $(".flex");
//console.log(MenuTableRows);
const menu: MenuData[] = [];
MenuTableRows.each((i, elem) => {
//const weekDay: string = $(elem).find("div:nth-child(1) > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim();
//if Monday.....
// if($(elem).find("div:nth-child(1) > div > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Mandag")
// {
// console.log("Mandag found!")
// dkMenuMandag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(1)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
// usMenuMandag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(2)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
// console.log(dkMenuOnsdag +"\r\n"+usMenuOnsdag)
// }
// if($(elem).find("div:nth-child(5) > div:nth-child(2) > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Tirsdag")
// {
// console.log("Tirsdag found!")
// dkMenuMandag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(1)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
// usMenuMandag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(2)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
// console.log(dkMenuOnsdag +"\r\n"+usMenuOnsdag)
// }
if($(elem).find("div:nth-child(6) > div:nth-child(1) > strong > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Onsdag")
{
//Wednesday found
console.log("Onsdag found!")
dkMenuOnsdag = $(elem).find("div:nth-child(3)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
usMenuOnsdag = $(elem).find("div:nth-child(4)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
console.log(dkMenuOnsdag +"\r\n"+usMenuOnsdag)
}
if($(elem).find("div:nth-child(1) > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Torsdag")
{
//Thursday found
console.log("Torsdag found!")
dkMenuTorsdag = $(elem).find("div:nth-child(3)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
usMenuTorsdag = $(elem).find("div:nth-child(4)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
console.log(dkMenuTorsdag +"\r\n"+usMenuTorsdag)
}
if($(elem).find("div:nth-child(1) > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Fredag")
{
//Friday found
console.log("Fredag found!")
dkMenuFredag = $(elem).find("div:nth-child(3)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
usMenuFredag = $(elem).find("div:nth-child(4)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
console.log(dkMenuFredag +"\r\n"+usMenuFredag)
}
});
//This part does'nt work!
MenuTableRows = $(".bulletin");
MenuTableRows.each((i, elem) => {
if($(elem).find("div:nth-child(1) > div > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Mandag")
{
//Monday found
console.log("Mandag found!")
dkMenuMandag = $(elem).find("div:nth-child(4) > div:nth-child(1)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
usMenuMandag = $(elem).find("div:nth-child(4) > div:nth-child(2)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
console.log(dkMenuOnsdag +"\r\n"+usMenuOnsdag)
}
if($(elem).find("div:nth-child(1) > div > div > div > div:nth-child(5) > div:nth-child(2) > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Tirsdag")
{
//Tuesday found
console.log("Tirsdag found!")
dkMenuTirsdag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(1)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
usMenuTirsdag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(2)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
console.log(dkMenuOnsdag +"\r\n"+usMenuOnsdag)
}
});
})
.catch(console.error);
I have tried a lot of query, and used chrome to copy the JA Path (See below)
Monday: DayTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(1) > div > strong") DkMenuTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(3)") UsMenuTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(4)")
Tuesday DayTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(5) > div:nth-child(2) > strong") DkMenuTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(5) > div:nth-child(4) > div:nth-child(1)") UsMenuTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(5) > div:nth-child(4) > div:nth-child(2)")
Upvotes: 0
Views: 45
Reputation: 56855
The app is a single page application, so data is loaded asynchronously after the page load and injected into the page with JavaScript. Look at view-source: to see what axios pulls down (the data isn't there). You can extract data from network responses in JSON format, but it seems this particular site does a good deal of processing and localization before rendering that data, so I'd start by using Puppeteer to scrape it.
Avoid long, browser-generated selectors if possible. These are very brittle. The app uses Tailwind for styling, which makes scraping difficult as there are a lot of repeated styles and generic classes. But we can do better than long, hardcoded div
chains.
Here's a first attempt to get you started:
const puppeteer = require("puppeteer"); // ^22.10.0
const url = "<Your URL>";
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.goto(url, {waitUntil: "domcontentloaded"});
const wrapper = await page.waitForSelector(".last\\:mb-0");
const data = await wrapper.evaluate(el =>
[...el.querySelectorAll(":scope > div")].map(e => ({
day: e.querySelector("div").textContent.trim(),
dishes: [...e.querySelectorAll("table tr")].map(e => ({
type: e.querySelector("td").textContent.trim(),
food: [...e.querySelectorAll("td > p > span")].map(e =>
e.textContent.trim()
),
allergens: [...e.querySelectorAll("td div.text-left")]
.map(e => e.textContent.trim())
.filter(e => e),
})),
}))
);
console.log(JSON.stringify(data, null, 2));
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Output:
[
{
"day": "Mandag",
"dishes": [
{
"type": "Dagens ret",
"food": [
"Pasta carbonara - pancetta - ost - salat",
"Pasta carbonara - pancetta - cheese - salad"
],
"allergens": [
"Æg",
"Gris",
"Gluten",
"Hvede"
]
},
{
"type": "Dagens vegetar ret",
"food": [
"Pasta - svampe - creme - squash - ost - salat",
"Pasta - mushrooms - cream - squash - cheese - salad"
],
"allergens": [
"Æg",
"Vegetar",
"Gluten",
"Hvede"
]
}
]
},
// ...
]
Upvotes: 0