Zak123
Zak123

Reputation: 443

Puppeteer, block window.location and return the page content?

I'm trying to get the full content of a pages with puppeteer, this works fine with normal pages but if it does a window.location redirect I want to block that redirect and just get the original content:

Ex. If https://example.com/thisredirects returns

<html>
<body>
<p>Page not found - Please wait while we redirect you home...</p>
<script type="text/javascript" language="javascript">
   window.location = "//example.com";
</script>
</body>
</html>

I want to get that html and block the location redirect. If I try to block/abort the location change with setRequestInterception response returns null and it doesn't actually fully block the redirect (it works for a redirect status code, but not a page that returns 200 and then redirects with window.location):

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: false });
  const pageUrl = "https://example.com/thisredirects";

  const page = await browser.newPage();
  await page.setCacheEnabled(false);
  await page.setRequestInterception(true);

  const requests = [];
  page.on('request', async request => {
    let isNavRequest = request.isNavigationRequest() && request.frame() === page.mainFrame();
    if (!isNavRequest) {
      request.continue();
      return;
    }
    requests.push(request);
    if (requests.length == 1) {
      console.log("Load initial page: " + request.url());
      request.continue();
      return;
    }
    console.log("Block redirect to: " + request.url());
    request.abort();
  });

  let response;
  try {
    console.log(`Request: ${pageUrl}`);
    response = await page.goto(pageUrl, { waitUntil: 'domcontentloaded' });
    const content = await response.text();
    console.log(content);
    await page.close();
    await browser.close();
  }
  catch (err) {
    console.log(err);
  }
})()

Is there a way to block the window.location and get the original HTML (as above) without completely disabling javascript?

Even if I listen to all responses:

  page.on('response', async response => {
    if (response.ok && response.url() === pageUrl) {
      console.log(await response.text());
    }
  });

It can't get the original HTML. It throws Could not load body for this request. This might happen if the request is a preflight request..

Upvotes: 5

Views: 1601

Answers (2)

Zak123
Zak123

Reputation: 443

I didn't realise sending a different error code (aborted) to request.abort lets you access the previous requests. With that I was able to access the text from the original response:

const page = await browser.newPage();
await page.setCacheEnabled(false);
await page.setRequestInterception(true);

const requests = [];
let redirectBlocked = false;

page.on('request', async request => {
    let isNavRequest = request.isNavigationRequest() && request.frame() === page.mainFrame();
    if (!isNavRequest) {
        request.continue();
        return;
    }

    requests.push(request);
    if (requests.length == 1) {
        request.continue();
        return;
    }

    // *snip* more here to detect legitimate redirects...

    redirectBlocked = true;
    request.abort('aborted');

    let originalResponse = await requests[0].response();
    console.log(await originalResponse.text());
});

const response = await page.goto(pageUrl, { waitUntil: 'domcontentloaded' });

if (!redirectBlocked) console.log(await response.text());

Upvotes: 1

theDavidBarton
theDavidBarton

Reputation: 8851

@GrafiCode's hint about page.setJavascriptEnabled(false) is a good one: later on, you can turn it back by setting its value to true.

To workaround the problem you can follow this:

  1. Disabling JavaScript to prevent window.location is reassigned immediately
  2. Navigate to the (dysfunctional) page
  3. Removing the <script> tags that try to manipulate the location (page.$$eval or page.evaluate puppeteer methods can be used to execute Element.remove())
  4. Saving the HTML after the cleaned - redirect-free - markup (page.content).
  5. Enabling JavaScript
  6. Setting the saved HTML on the page (page.setContent)
  7. You won't be able to access response.text() the same way as you tried above (as setContent returns differently than goto) but you can use page.$eval on the innerText of the <body>
const page = await browser.newPage()
await page.setJavaScriptEnabled(false)
await page.goto(pageUrl)

await page.$$eval('script', scripts =>
  scripts.forEach(src => {
    if (src.innerHTML.includes('window.location')) src.remove()
  })
)

const html = await page.content()
await page.setJavaScriptEnabled(true)
await page.setContent(html)

const text = await page.$eval('body', el => el.innerText)
console.log(text)

Output (the content of the <p>):

Page not found - Please wait while we redirect you home...

Upvotes: 1

Related Questions