Reputation: 228
I am working on a project where I need to scrape the webpage so I gone through with tutorials and I found that phantomJs would be the best choice for it. as it allows us to get HTML content of angularJs site and ajax based view sites, and I have already write code for it and working fine, But the problem is that I am not able to get css and js file if that has only written short path of files.
if the victim is using full URL of the site like below
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.4/jquery.min.js"></script>
it works fine because the victim is using full URL for js which I can use. but if the victim is using url
<script src="assets/js/jquery.min.js"></script>
then it is a problem for me I am not able to get css and js of my current HTML contents, so as far as what I did I have written some phantom code. I have posted below.
var page = new WebPage()
var fs = require('fs');
page.onLoadFinished = function() {
console.log("page load finished");
page.render('export.png');
fs.write('1.html', page.content, 'w');
phantom.exit();
};
page.open("http://insttaorder.com/", function() {
page.evaluate(function() {
});
});
What I need is, I need all css and js file on my local computer, I have searched on google, GitHub but not get any specific solution for that,
Upvotes: 0
Views: 1938
Reputation: 16838
The strategy of solving the task is this:
Although PhantomJS could be used to download and save the files, it would be very suboptimal to do so. Instead let us follow that unix philisophy that one program should do just one job but do it well. We will use the excellent wget
utility to download files from a list that PhantomJS will prepare.
var page = require('webpage').create();
var fs = require('fs');
page.open('http://insttaorder.com/', function(status)
{
// Get all links to CSS and JS on the page
var links = page.evaluate(function(){
var urls = [];
$("[rel=stylesheet]").each(function(i, css){
urls.push(css.href);
});
$("script").each(function(i, js){
if(js.src) {
urls.push(js.src);
}
});
return urls;
});
// Save all links to a file
var url_file = "list.txt";
fs.write(url_file, links.join("\n"), 'w');
// Launch wget program to download all files from the list.txt to current folder
require("child_process").execFile("wget", ["-i", url_file], null, function (err, stdout, stderr) {
console.log("execFileSTDOUT:", stdout);
console.log("execFileSTDERR:", stderr);
// After wget finished exit PhantomJS
phantom.exit();
});
});
Upvotes: 2
Reputation: 353
You can get all requested resources via onResourceRequested
event.
By checking the request method and url, you can filter out the resources you don't want and download it yourself later.
You don't need to worry about the path, the url
you get from the event is always complete.
var webPage = require('webpage');
var page = webPage.create();
page.onResourceRequested = function(req) {
if(req.method === 'GET')
if(req.url.endsWith('.css')) console.log('requested css file', JSON.stringify(req));
else if (req.url.endsWith('.js')) console.log('requested js file', JSON.stringify(req));
};
More about onResourceRequested
Upvotes: 1