Reputation: 47
I am trying to write a web scraper for a NYC database of building and I am trying to get the html of the actual website. For whatever reason, when I put the url of the website I am trying to scrape, my program does nothing. Whenever I put the url of almost any other website, I actually get the html i requested. Is this because I am trying to scrape a government site?
var request = require("request");
request(
{ uri: "http://a810-bisweb.nyc.gov/bisweb/JobsQueryByNumberServlet?requestid=3&passjobnumber=123768556&passdocnumber=01" },
function(error, response, body) {
console.log(body);
console.log("hello")
}
);
I expected to recieve the html as a string printed in my console, instead, I get nothing. The "hello" is not even printed. However, when I try any other site, I get the actual html string.
Upvotes: 0
Views: 150
Reputation: 47
For anyone wondering, I was able to work around the restrictions the site set up by using tampermonkey. I just needed to access the DOM anyway, so tampermonkey let me run a script as soon as I entered the site
Upvotes: 0
Reputation: 1612
The url you are trying to get is giving an access denied.
I prefer the promise based api for request so the following code
var request = require("request");
request
.get("http://a810-bisweb.nyc.gov/bisweb/JobsQueryByNumberServlet?requestid=3&passjobnumber=123768556&passdocnumber=01")
.on('response', function(response) {
console.log('Hello');
console.log(response.statusCode);
console.log(response.headers['content-type']);
})
.on('error', function(error){
console.log(error);
})
will print out
Hello
403
text/html
I am supposing the reason why you are getting that 403 is the site probably sets cookies or has some session state and you are going directly to the url you want instead of hitting the front page first. I get the 403 as well in the browser if I go directly to the url, but if I go to the front page first and then to the url I get the page.
Upvotes: 2