rakesh
rakesh

Reputation: 23

Extracting contents from a webpage and comparing using Java

I am developing a Java project in which i have a sub-module where i need to extract contents [text, image, color] from a webpage and compare it with another webpage. I am planning to use WinHTTrack software for downloading the webpage locally, but the problem is it doesn't save it as HTML. How can i download a webpage with HTML extension using softwares such as WinHTTrack [or just saving the webpage through ctrl+s is enogh.?]. Also i am planning to use HTML Parsers to extract the 3 content types[text, image, color],after downloading the webpage locally. So which parser to go with.?

Upvotes: 0

Views: 348

Answers (1)

aretai
aretai

Reputation: 1641

WEll I use Httrack and it fetches html files as well. You are probably taking winhttrack project file as the only output file, but if you check inside the project directory there are html files (together with images, etc). I would suggest using - http://htmlparser.sourceforge.net/. It is a java library and since your project is a Java project it should be fairly easy to use it. You can also save the whole website locally using org.htmlparser.parserapplications.SiteCapturer (and specify whether resources such as images should be captured as well). Hope it helps.

Upvotes: 1

Related Questions