Samuel Private
Samuel Private

Reputation: 31

Get FULL HTML content web page (including javascript content)

After some hours of trying and reading, I'm a bit lost about the title subject.

My problem : I am trying to get the full HTML content (javascript HTML appended/added content) of a single web page. What I have already try :

So now, the question is, how can I imitate the "save as" function of a browser or how can I, in general, get the full HTML content first AND then use Jsoup to scan the static final HTML content ?

Thanks a lot for your advise and your help !

Upvotes: 0

Views: 5114

Answers (2)

Samuel Private
Samuel Private

Reputation: 31

I finally get what i wanted to. I will try to explain for thoose who need some help!


So ! The process is composed by two steps :

  • First, get the final content HTML (including javascript HTML content, etc.) like if you were visiting the web page and then save it to a simply file.html
  • Then, we are going to use the Jsoup library to get the wanted content in the saved file, file.hmtl.

1 - Get HTML content and save it

For this step, you will need to download phantomjs and use it to get the content. Here is the code to get the target page. Just change myTargetedPage.com by the URL of the page you want to get and the name of the file mySaveFile.html.

var page = require('webpage').create();
var fs = require('fs');
page.open('http://myTargetedPage.com', function () {
    page.evaluate();
    fs.write('mySaveFile.html', page.content, 'w');
    phantom.exit();
});

As you can see, the file saved is exactly the same as the content load in your browser.

2 - Extract the content you wanted

Now, we will use Java and the library Jsoup to get or specific content. in my example, I want to get this part of the web page :

/* HTML CONTENT */
<span class="my class" data="data1"></span>
/* HTML CONTENT */
<span class="my class" data="data2"></span>
/* HTML CONTENT */

To get this, this code will be fine (don't forget to edit thePathToYourSavedFile.html :

public static void main(String[] args) throws Exception {
    String url = "thePathToYourSavedFile.html";

    Document document = Jsoup.connect(url).userAgent("Mozilla").get();

    Elements spanList= document.select("span");

   for (Element span: spanList) {
       if(span.attr("class").equals("my class")){
           String data = span.attr("data");
           System.out.println("data : "+data);             
       }
    }       
}

Enjoy !

Upvotes: 2

Sari Rahal
Sari Rahal

Reputation: 1955

There is a nice plugin that gives you what you are looking for. It offers a way to see a page and it's functionality. It is available for some of the browsers but not all. Here is the link : http://chrispederick.com/work/web-developer/

P.S. after you install it, there is a little gear on the toolbar located at the top right. That is where all the functions is at.

Upvotes: 0

Related Questions