Reputation: 111

input URL, output contents of "view page source", i.e. after javascript / etc, library or command-line

I need a scalable, automated, method of dumping the contents of "view page source", after manipulation, to a file. This non-interactive method would be (more or less) identical to an army of humans navigating my list of URLs and dumping "view page source" to a file. Programs such as wget or curl will non-interactively retrieve a set of URLs, but do not execute javascript or any of that 'fancy stuff'.

My ideal solution looks like any of the following (fantasy solutions):

cat urls.txt | google-chrome --quiet --no-gui \
--output-sources-directory=~/urls-source  
(fantasy command line, no idea if flags like these exist)

cat urls.txt | python -c "import some-library; \
... use some-library to process urls.txt ; output sources to ~/urls-source"

As a secondary concern, I also need:

dump all included javascript source to file (a la firebug)
dump pdf/image of page to file (print to file)

Upvotes: 1

Answers (2)

seagulf

Reputation: 378

You can use IRobotSoft web scraper to automate this. The source code is in UpdatedPage variable. You only need to save the variable to a file.

It has a function CapturePage() to capture the web page to an image file too.

Upvotes: 0

mexique1

Reputation: 1693

HTML Unit does execute javascript. Not sure if you can obtain the HTML code after DOM manipulation, but give it a try.

You could write a little Java program that fits your requirements, and execute it through command line like in your examples.

I haven't tried the below code, just had a look at the JavaDoc :

public static void main(String[] args) {

    String pageURL = args[1];

    WebClient webClient = new WebClient();
    HtmlPage page = webClient.getPage(pageURL);

    String pageContents = page.asText();

    // Save the resulting page to a file

}

EDIT :

Selenium (another web testing framework) can take page screenshots it seems.

Search for selenium.captureScreenshot.

Upvotes: 1

input URL, output contents of &quot;view page source&quot;, i.e. after javascript / etc, library or command-line

Answers (2)

Related Questions

input URL, output contents of "view page source", i.e. after javascript / etc, library or command-line