Extracting contents from a webpage and comparing using Java

Question

I am developing a Java project in which i have a sub-module where i need to extract contents [text, image, color] from a webpage and compare it with another webpage. I am planning to use WinHTTrack software for downloading the webpage locally, but the problem is it doesn't save it as HTML. How can i download a webpage with HTML extension using softwares such as WinHTTrack [or just saving the webpage through ctrl+s is enogh.?]. Also i am planning to use HTML Parsers to extract the 3 content types[text, image, color],after downloading the webpage locally. So which parser to go with.?

aretai · Accepted Answer

WEll I use Httrack and it fetches html files as well. You are probably taking winhttrack project file as the only output file, but if you check inside the project directory there are html files (together with images, etc). I would suggest using - http://htmlparser.sourceforge.net/. It is a java library and since your project is a Java project it should be fairly easy to use it. You can also save the whole website locally using org.htmlparser.parserapplications.SiteCapturer (and specify whether resources such as images should be captured as well). Hope it helps.

Extracting contents from a webpage and comparing using Java

Answers (1)

Related Questions