Tue M
Tue M

Reputation: 1

Cheerio problem scraping specific html data elements

Hi I am trying to scrape some data from a website with Cheerio. Its a menu monday to friday

I found a way to scrape the menu Wednesday to Friday, But I am struggeling with Monday and Tuesday.

The site is not very structured, but maybe someone can give me a clue

Here is the HTML:

<div class="w-full flex-1">
<div class="w-full relative text-base">
<div class="mb-5 last:mb-0">
<div class="flex flex-row mt-8 relative items-baseline print-avoid-inside-break">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong>Mandag</strong></div>
</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">&nbsp;</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Kylling club burger - karry mayo - tomat - agurk - salat - løg</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Chicken club burger - curry mayo - tomato - cucumber - salad - onion</div>
<div class="flex flex-row mt-8 relative items-baseline print-avoid-inside-break">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong><br></strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong>Tirsdag</strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong><br></strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Stegte nudler - gris - grønt - koriander - chili - soya - sweet chili</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Fried noodles - pork - vegetables - coriander - chili - soya - sweet chili</div>
</div>
</div>
<div class="flex flex-row mt-8 relative items-baseline print-avoid-inside-break">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong><strong>Onsdag</strong></strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">&nbsp;</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Stegt fisk - sauce tartar - ratatouille - rosmarin kartofler</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Fried fish - sauce tartar - ratatouille - rosemary potatoes</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">&nbsp;</div>
</div>
<div class="flex flex-row mt-8 relative items-baseline print-avoid-inside-break">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong>Torsdag</strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">&nbsp;</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Pariserbøf med tilbehør</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Steak a la paris - with sides</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">&nbsp;</div>
</div>
<div class="flex flex-row mt-8 relative items-baseline print-avoid-inside-break">
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong>Fredag</strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap"><strong><span data-slate-fragment="JTVCJTdCJTIydHlwZSUyMiUzQSUyMnBhcmFncmFwaCUyMiUyQyUyMmNoaWxkcmVuJTIyJTNBJTVCJTdCJTIydGV4dCUyMiUzQSUyMk1lZGFsam9uJTIwbWVkJTIwYmFjb24lMkMlMjBmbCVDMyVCOGRla2FydG9mbGVyJTJDJTIwYmFndCUyMHRvbWF0JTIwb2clMjBzYWxhdCUyMG1lZCUyMHJldmV0JTIwY2l0cm9uc2thbCUyMiU3RCU1RCU3RCU1RA=="><br></span></strong></div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">Frankfurter og spareribs på grill (udenfor) Bagekartofler m. creme fraiche dressing</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">"Sausages and ribs on the grill (outside) baked potatoes with sour creme dressing</div>
<div class="w-24 font-semibold flex-shrink-0 whitespace-nowrap">&nbsp;</div>
</div>
</div>
</div>
</div>

My typescript code

A little Danish translation :-) Monday=Mandag, Tuesday=Tirsdag, Wednesday=Onsdag, Thursday=Torsdag, Friday=Fredag

  AxiosInstance.get(url)
  .then((response) => {
    
    let dkMenuMandag: string;
    let usMenuMandag: string;
    let dkMenuTirsdag: string;
    let usMenuTirsdag: string;
    let dkMenuOnsdag: string;
    let usMenuOnsdag: string; 
    let dkMenuTorsdag: string;
    let usMenuTorsdag: string; 
    let dkMenuFredag: string;
    let usMenuFredag: string; 

    const html = response.data;
    const $ = cheerio.load(html);
    let MenuTableRows = $(".flex");
    //console.log(MenuTableRows);
    const menu: MenuData[] = [];
    MenuTableRows.each((i, elem) => {
      //const weekDay: string = $(elem).find("div:nth-child(1) > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim();
      //if Monday.....
      // if($(elem).find("div:nth-child(1) > div > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Mandag")
      //   {
      //     console.log("Mandag found!")
      //     dkMenuMandag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(1)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
      //     usMenuMandag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(2)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
      //     console.log(dkMenuOnsdag +"\r\n"+usMenuOnsdag)
      //   }


      // if($(elem).find("div:nth-child(5) > div:nth-child(2) > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Tirsdag")
      //   {
      //     console.log("Tirsdag found!")
      //     dkMenuMandag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(1)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
      //     usMenuMandag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(2)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
      //     console.log(dkMenuOnsdag +"\r\n"+usMenuOnsdag)
      //   }

      if($(elem).find("div:nth-child(6) > div:nth-child(1) > strong > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Onsdag")
      {
        //Wednesday found
        console.log("Onsdag found!")
        dkMenuOnsdag = $(elem).find("div:nth-child(3)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
        usMenuOnsdag = $(elem).find("div:nth-child(4)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
        console.log(dkMenuOnsdag +"\r\n"+usMenuOnsdag)
      }

      if($(elem).find("div:nth-child(1) > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Torsdag")
        {
          //Thursday found
          console.log("Torsdag found!")
          dkMenuTorsdag = $(elem).find("div:nth-child(3)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
          usMenuTorsdag = $(elem).find("div:nth-child(4)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
          console.log(dkMenuTorsdag +"\r\n"+usMenuTorsdag)
        }

       if($(elem).find("div:nth-child(1) > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Fredag")
          {
            //Friday found
            console.log("Fredag found!")
            dkMenuFredag = $(elem).find("div:nth-child(3)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
            usMenuFredag = $(elem).find("div:nth-child(4)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
            console.log(dkMenuFredag +"\r\n"+usMenuFredag)
          }
    });



    //This part does'nt work!

    MenuTableRows = $(".bulletin");
    MenuTableRows.each((i, elem) => {
      if($(elem).find("div:nth-child(1) > div > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Mandag")
        {
          //Monday found
          console.log("Mandag found!")
          dkMenuMandag = $(elem).find("div:nth-child(4) > div:nth-child(1)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
          usMenuMandag = $(elem).find("div:nth-child(4) > div:nth-child(2)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
          console.log(dkMenuOnsdag +"\r\n"+usMenuOnsdag)
        }


      if($(elem).find("div:nth-child(1) > div > div > div > div:nth-child(5) > div:nth-child(2) > strong").text().replace(/(\r\n|\n|\r)/gm, "").trim() === "Tirsdag")
        {
          //Tuesday found
          console.log("Tirsdag found!")
          dkMenuTirsdag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(1)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
          usMenuTirsdag = $(elem).find("div:nth-child(5) > div:nth-child(4) > div:nth-child(2)").text().replace(/(\r\n|\n|\r)/gm, "").trim();
          console.log(dkMenuOnsdag +"\r\n"+usMenuOnsdag)
        }

    });
  })
  .catch(console.error);

I have tried a lot of query, and used chrome to copy the JA Path (See below)

Monday: DayTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(1) > div > strong") DkMenuTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(3)") UsMenuTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(4)")

Tuesday DayTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(5) > div:nth-child(2) > strong") DkMenuTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(5) > div:nth-child(4) > div:nth-child(1)") UsMenuTxt: document.querySelector("body > div.wrapper > div > div.main-container.col1-layout > div > div.col-main > div > div.bulletin > div:nth-child(1) > div > div > div > div:nth-child(5) > div:nth-child(4) > div:nth-child(2)")

Upvotes: 0

Views: 45

Answers (1)

ggorlen
ggorlen

Reputation: 56855

The app is a single page application, so data is loaded asynchronously after the page load and injected into the page with JavaScript. Look at view-source: to see what axios pulls down (the data isn't there). You can extract data from network responses in JSON format, but it seems this particular site does a good deal of processing and localization before rendering that data, so I'd start by using Puppeteer to scrape it.

Avoid long, browser-generated selectors if possible. These are very brittle. The app uses Tailwind for styling, which makes scraping difficult as there are a lot of repeated styles and generic classes. But we can do better than long, hardcoded div chains.

Here's a first attempt to get you started:

const puppeteer = require("puppeteer"); // ^22.10.0

const url = "<Your URL>";

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const wrapper = await page.waitForSelector(".last\\:mb-0");
  const data = await wrapper.evaluate(el =>
    [...el.querySelectorAll(":scope > div")].map(e => ({
      day: e.querySelector("div").textContent.trim(),
      dishes: [...e.querySelectorAll("table tr")].map(e => ({
        type: e.querySelector("td").textContent.trim(),
        food: [...e.querySelectorAll("td > p > span")].map(e =>
          e.textContent.trim()
        ),
        allergens: [...e.querySelectorAll("td div.text-left")]
          .map(e => e.textContent.trim())
          .filter(e => e),
      })),
    }))
  );
  console.log(JSON.stringify(data, null, 2));
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Output:

[
  {
    "day": "Mandag",
    "dishes": [
      {
        "type": "Dagens ret",
        "food": [
          "Pasta carbonara - pancetta - ost - salat",
          "Pasta carbonara - pancetta - cheese - salad"
        ],
        "allergens": [
          "Æg",
          "Gris",
          "Gluten",
          "Hvede"
        ]
      },
      {
        "type": "Dagens vegetar ret",
        "food": [
          "Pasta - svampe - creme - squash - ost - salat",
          "Pasta - mushrooms - cream - squash - cheese - salad"
        ],
        "allergens": [
          "Æg",
          "Vegetar",
          "Gluten",
          "Hvede"
        ]
      }
    ]
  },
  // ...
]

Upvotes: 0

Related Questions