madruk20
madruk20

Reputation: 33

Redirected when making HTTP request for scraping content

I'm relatively new to scraping and wanted to try this as a learning experience. My end goal is to be able to scrape item stats from a game website https://lucy.allakhazam.com/ and post them via a Discord bot. However I've run into a problem even trying to load the HTML from the site and I'm not sure what the problem is.

request("https://lucy.allakhazam.com/item.html?id=28855", function(error, response, html) {
  if(error) {
    console.log("Error: " + error);
  }
  console.log("Status code: " + response.statusCode);

  var $ = cheerio.load(html);
    console.log(html);
});

The only output from the console is:

<head><meta HTTP-EQUIV="Refresh" CONTENT="0; URL=/index.html?setcookie=1"></head>

I've tried experimenting with other sites and I'm able to get the raw html from them, but not this one and I'm not sure why. Any help is appreciated thank you!

Upvotes: 2

Views: 196

Answers (1)

ggorlen
ggorlen

Reputation: 57425

I'd use a promise-based request library like fetch (native since Node 18), node-fetch or axios. One option is to hardcode in the redirect URL:

const cheerio = require("cheerio"); // 1.0.0-rc.12

const url = "https://lucy.allakhazam.com/item.html?id=28855&setcookie=1";

fetch(url)
  .then(res => {
    if (!res.ok) {
      throw Error(res.statusText);
    }

    return res.text();
  })
  .then(html => {
    const $ = cheerio.load(html);
    const text = $(".shotdata")
      .contents()
      .get()
      .map(e => $(e).text().trim())
      .filter(e => e);
    console.log(text);
  });

If you need to handle a dynamic redirect, you could parse the redirected URL and perform a second request:

const cheerio = require("cheerio"); // 1.0.0-rc.12

const get = url =>
  fetch(url).then(res => {
    if (!res.ok) {
      throw Error(res.statusText);
    }

    return res.text();
  });

const url = "https://lucy.allakhazam.com/item.html?id=28855";
get(url)
  .then(html => {
    const $ = cheerio.load(html);
    const redirect = $('meta[http-equiv="Refresh"]')
      .attr("content")
      .split("/")
      .at(-1);
    return get(`${new URL(url).origin}/${redirect}`);
  })
  .then(html => {
    const $ = cheerio.load(html);
    const text = $(".shotdata")
      .contents()
      .get()
      .map((e) => $(e).text().trim())
      .filter((e) => e);
    console.log(text);
  });

Upvotes: 2

Related Questions