Ant
Ant

Reputation: 1143

I'm requesting html content from a site with axios in JS but the website is blocking my request

I want my script to pull the html data from a site, but it is returning a page that says it knows my script is a bot and giving it an 'I am not a robot' test to pass.

Instead of returning the content of the site it returns a page that partly reads... "

As you were browsing, something about your browser\n made us think you were a bot."

My code is...

const axios = require('axios');

const url = "https://www.bhgre.com/Better-Homes-and-Gardens-Real-Estate-Blu-Realty-49231c/Brady-Johnson-7469865a";
axios(url, {headers: {
  'Mozilla': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.3 Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/43.4.0',
}})
.then(response => {
  const html = response.data;
  console.log(html)
})
.catch(console.error);

I've tried a few different headers, but there's no fooling the site into thinking my script is human. This is in NodeJS.

Maybe this does or does not have bearing on my issue, but this code will hopefully live on the backend of my site in React I'm building. I'm not trying to scrape the site as a one off. I would like my site to read from this site for a little bit of content, instead of having to manually update my site with bits of content on this site whenever it changes.

Upvotes: 3

Views: 4352

Answers (1)

VPaul
VPaul

Reputation: 1013

Accessing every site using axios or curl is not possible. There are various kinds of checks including CORS that can prevent someone to access a site directly via a client other than the browser.

You can achieve the same using phantom (https://www.npmjs.com/package/phantom). This is commonly used by scrapers and if you're afraid that the other site may block you for repeated access, you can use a random interval before making requests. If you need to read something from the returned HTML page, you can use cheerio (https://www.npmjs.com/package/cheerio).

Hope it helps.

Below is the code that I tried and worked for your URL:

const phantom = require('phantom');

(async () => {
    const url = "https://www.bhgre.com/Better-Homes-and-Gardens-Real-Estate-Blu-Realty-49231c/Brady-Johnson-7469865a";
    const instance = await phantom.create(['--ignore-ssl-errors=yes', '--load-images=no']);
    const page = await instance.createPage();
    const status = await page.open(url);

    if (status !== 'success') {
      console.error(status);
      await instance.exit();
      return;
    }

    const content = await page.property('content');
    await instance.exit();
    console.log(content);
})();

Upvotes: 4

Related Questions