Jia
Jia

Reputation: 2581

Get complete web page source html with puppeteer - but some part always missing

I am trying to scrape specific string on webpage below :

https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl;

The info I want to get from this web page source is the number serial in string below (that is something I can search when right-click mouse ->

"View Page source"): 
 name="nr_rooms_4377601_232287150_0_1_0"/ name="nr_rooms_4377601_232287150_1_1_0" 

I am using "puppeteer" and below is my code :

const puppeteer = require('puppeteer');
(async() => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    //await page.goto('https://example.com');
    const response = await page.goto("My-url-above");
    let bodyHTML = await page.evaluate(() => document.body.innerHTML);
    let outbodyHTML = await page.evaluate(() => document.body.outerHTML);
    console.log(await response.text());
    console.log(await page.content());
    await browser.close();
})()

But I cannot find the strings I am looking for in response.text() or page.content().

Am I using the wrong methods in page ?

How can I dump the actual page source on the web page , the one exactly the same as I right-click the mouse ?

Upvotes: 4

Views: 8193

Answers (2)

Mikhail Zub
Mikhail Zub

Reputation: 474

Seems booking.com is blocking you. I strongly recommend you use Puppeteer with puppeteer-extra and puppeteer-extra-plugin-stealth packages to prevent website detection that you are using headless Chromium or that you are using a web driver.

And after you go to the URL you need to wait until the page loads:

const puppeteer = require("puppeteer-extra");
const StealthPlugin = require("puppeteer-extra-plugin-stealth");

const { executablePath } = require("puppeteer");

puppeteer.use(StealthPlugin());

(async () => {
  const browser = await puppeteer.launch({
    headless: true,
    args: ["--no-sandbox", "--disable-setuid-sandbox", "--window-size=1600,900", "--single-process"],
    executablePath: executablePath(),
  });

  const page = await browser.newPage();
  await page.setViewport({
    width: 1280,
    height: 720,
  });
  const url = "https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl";
  await page.goto(url);
  // wait for load selector with id=hp_hotel_name
  await page.waitForSelector("#hp_hotel_name");

  // now you can do what you want

  await browser.close();
})();

As an alternative, to get all info about the hotel you can use hotels-scraper-js library. Then your code will be:

import { booking } from "hotels-scraper-js";

booking.getHotelInfo("https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html").then((result) => console.dir(result, { depth: null }));

The output will looks like:

{
   "title":"Sanadome Nijmegen",
   "type":"Hotel",
   "stars":4,
   "preferredBadge":true,
   "subwayAccess":false,
   "sustainability":"",
   "address":"Weg door Jonkerbos 90, 6532 SZ Nijmegen, Netherlands",
   "highlights":[

   ],
   "description":"You're eligible for a Genius discount at Sanadome Nijmegen!"... and more description,
   "descriptionHighlight":"Couples particularly like the location — they rated it 8.3 for a two-person trip.",
   "descriptionSummary":"Sanadome Nijmegen has been welcoming Booking.com guests since 10 Jun 2010.",
   "facilities":["Indoor swimming pool", "Parking on site",... and more facilities],
   "areaInfo":[
      {
         "What's nearby":[
            {
               "place":"Goffertpark",
               "distance":"650 m"
            },
            ... and more nearby places
         ]
      },
      ... and other area info
   ],
   "link":"https://www.booking.com/hotel/nl/scandic-sanadome-nijmegen.en-gb.html",
   "photos":[
      "https://cf.bstatic.com/xdata/images/hotel/max1024x768/196181914.jpg?k=e37d21c8a403e920b868bcd7845dbca656d772bc114dc10473a76de52afc67bc&o=&hp=1",
      "https://cf.bstatic.com/xdata/images/hotel/max1024x768/225703925.jpg?k=0d4938ca6752057ba607d2fd7fb8cf95cec000770a68738b92ef3b6688e8a62e&o=&hp=1",
      ... and other photos
   ],
   "reviewsInfo":{
      "score":7.8,
      "scoreDescription":"Rated good",
      "totalReviews":823,
      "categoriesRating":[
         {
            "Staff":8.5
         },
         ... and other categories
      ],
      "reviews":[
         {
            "name":"Ewelina",
            "avatar":"https://cf.bstatic.com/static/img/review/avatars/ava-e/8d80ab6bf73fa873e990c76bfc96a1bf23708307.png",
            "country":"Poland",
            "date":"16 February 2023",
            "reting":"10",
            "review":[
               {
                  "liked":"very beautiful surroundings.  I love the peace and quiet around 🥰"
               }
            ]
         },
         ... and other reviews
      ]
   }
}

Upvotes: 0

theDavidBarton
theDavidBarton

Reputation: 8871

If you investigate where these strings are appearing then you can see that in <select> elements with a specific class (.hprt-nos-select):

<select
  class="hprt-nos-select"
  name="nr_rooms_4377601_232287150_0_1_0"
  data-component="hotel/new-rooms-table/select-rooms"
  data-room-id="4377601"
  data-block-id="4377601_232287150_0_1_0"
  data-is-fflex-selected="0"
  id="hprt_nos_select_4377601_232287150_0_1_0"
  aria-describedby="room_type_id_4377601 rate_price_id_4377601_232287150_0_1_0 rate_policies_id_4377601_232287150_0_1_0"
>

You would wait until this element is loaded into the DOM, then it will be visible in the page source as well:

await page.waitForSelector('.hprt-nos-select', { timeout: 0 });

BUT your issue actually lies in the fact, that the url you are visiting has some extra URL parameters: ?checkin=2020-09-19;checkout=2020-09-20;i_am_from=nl; which are not taken into account by puppeteer (you can take a full page screenshot and you will see that it still has the default hotel search form without the specific hotel offers, and not the ones you are expecting).

You should interact with the search form with puppeteer (page.click() etc.) to set the dates and the origin country yourself to achieve the expected page content.

Upvotes: 2

Related Questions