Reputation: 111
I need a scalable, automated, method of dumping the contents of "view page source", after manipulation, to a file. This non-interactive method would be (more or less) identical to an army of humans navigating my list of URLs and dumping "view page source" to a file. Programs such as wget or curl will non-interactively retrieve a set of URLs, but do not execute javascript or any of that 'fancy stuff'.
My ideal solution looks like any of the following (fantasy solutions):
cat urls.txt | google-chrome --quiet --no-gui \
--output-sources-directory=~/urls-source
(fantasy command line, no idea if flags like these exist)
or
cat urls.txt | python -c "import some-library; \
... use some-library to process urls.txt ; output sources to ~/urls-source"
As a secondary concern, I also need:
Upvotes: 1
Views: 1022
Reputation: 378
You can use IRobotSoft web scraper to automate this. The source code is in UpdatedPage variable. You only need to save the variable to a file.
It has a function CapturePage() to capture the web page to an image file too.
Upvotes: 0
Reputation: 1693
HTML Unit does execute javascript. Not sure if you can obtain the HTML code after DOM manipulation, but give it a try.
You could write a little Java program that fits your requirements, and execute it through command line like in your examples.
I haven't tried the below code, just had a look at the JavaDoc :
public static void main(String[] args) {
String pageURL = args[1];
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(pageURL);
String pageContents = page.asText();
// Save the resulting page to a file
}
EDIT :
Selenium (another web testing framework) can take page screenshots it seems.
Search for selenium.captureScreenshot.
Upvotes: 1