chrisdunnbirch
chrisdunnbirch

Reputation: 1

Is it possible to scrape any given URL with NodeJS?

est I'll preface this by saying this is something that is new to me and is purely a learning exercise, so please excuse any naivety.

I've been looking through some articles on scraping and it seems that NodeJS, ExpressJS, Request and Cheerio would be my preferred method as a Front-End guy who is comfortable with JS/jQuery.

All the articles I've read so far focus on scraping data from a specific website in the absence of an API, whereas what I am looking to achieve to start with is a tool which takes any given URL and returns a true/false for a list of which common libraries are being used and which social networks are linked.

For example, a user enters a URL and the results return a "This website uses jQuery, MooTools, BackboneJS, AngularJS, etc" and "This website is linked with Facebook, Twitter, etc". Somewhat similar to Tregia: http://www.tregia.com/process?q=http://smashingmagazine.com.

Is my chosen setup (above) appropriate or limited to only scraping specific pages due to CSS selectors?

Upvotes: 0

Views: 106

Answers (1)

Sleep Deprived Bulbasaur
Sleep Deprived Bulbasaur

Reputation: 2458

You should be able to scrape all pages and then find their tags and read which tools they're using (although keep in mind they may have renamed them [ex angularjs3.1.0.js - > foobar.js] to keep people from knowing their stack). You should also be able to get the specific text within the rest of the tags that you feel relevant as well.

You should try and pay attention to every page's robots.txt as well.

Edit: You probably won't be able to scrape "members"/"login only" areas of sites though.

Upvotes: 1

Related Questions