Shakti Sisodiya
Shakti Sisodiya

Reputation: 228

How to get css files and js files when scraping a web page using phantomjs

I am working on a project where I need to scrape the webpage so I gone through with tutorials and I found that phantomJs would be the best choice for it. as it allows us to get HTML content of angularJs site and ajax based view sites, and I have already write code for it and working fine, But the problem is that I am not able to get css and js file if that has only written short path of files.

if the victim is using full URL of the site like below

<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>

it works fine because the victim is using full URL for js which I can use. but if the victim is using url

<script src="assets/js/jquery.min.js"></script>

then it is a problem for me I am not able to get css and js of my current HTML contents, so as far as what I did I have written some phantom code. I have posted below.

 var page = new WebPage()
    var fs = require('fs');

    page.onLoadFinished = function() {
      console.log("page load finished");
      page.render('export.png');
      fs.write('1.html', page.content, 'w');
      phantom.exit();
    };

    page.open("http://insttaorder.com/", function() {
      page.evaluate(function() {
      });
    });

What I need is, I need all css and js file on my local computer, I have searched on google, GitHub but not get any specific solution for that,

Upvotes: 0

Views: 1938

Answers (2)

Vaviloff
Vaviloff

Reputation: 16838

The strategy of solving the task is this:

  • Open the page in PhantomJS
  • Enumerate all the links to JS and CSS resources
  • Download them all

Although PhantomJS could be used to download and save the files, it would be very suboptimal to do so. Instead let us follow that unix philisophy that one program should do just one job but do it well. We will use the excellent wget utility to download files from a list that PhantomJS will prepare.

var page = require('webpage').create();
var fs = require('fs');

page.open('http://insttaorder.com/', function(status) 
{
    // Get all links to CSS and JS on the page
    var links = page.evaluate(function(){

        var urls = [];

        $("[rel=stylesheet]").each(function(i, css){ 
            urls.push(css.href);
        });

        $("script").each(function(i, js){
            if(js.src) {
                urls.push(js.src);
            }
        });

        return urls;
    });

    // Save all links to a file
    var url_file = "list.txt";
    fs.write(url_file, links.join("\n"), 'w');

    // Launch wget program to download all files from the list.txt to current folder
    require("child_process").execFile("wget", ["-i", url_file], null, function (err, stdout, stderr) {

      console.log("execFileSTDOUT:", stdout);
      console.log("execFileSTDERR:", stderr);

      // After wget finished exit PhantomJS
      phantom.exit();

    });

});

Upvotes: 2

Wenli He
Wenli He

Reputation: 353

You can get all requested resources via onResourceRequested event. By checking the request method and url, you can filter out the resources you don't want and download it yourself later.

You don't need to worry about the path, the url you get from the event is always complete.

var webPage = require('webpage');
var page = webPage.create();

page.onResourceRequested = function(req) {
  if(req.method === 'GET')
    if(req.url.endsWith('.css')) console.log('requested css file', JSON.stringify(req));
    else if (req.url.endsWith('.js')) console.log('requested js file', JSON.stringify(req));
};

More about onResourceRequested

Upvotes: 1

Related Questions