Reputation: 2325
I am doing an assignment for one of my classes.
I am supposed to write a webcrawler that download files and images from a website given a specified crawl depth.
I am allowed to use third party parsing api so I am using Jsoup. I've also tried htmlparser. Both nice softwares but they are not perfect.
I used the default java URLConnection to check content type before processing the url but it becomes really slow as the number of links grows.
Question : Anyone know any specialized parser api for images and links ?
I could start writing mine using Jsoup but am being lazy. Besides why reinvent the wheel if there could be a working solution out there? Any help would be appreciated.
i need to check contentType while looping through the links to check if the link is to a file, in an effective way but Jsoup does not have what i need. Heres what i have: **
HttpConnection mimeConn =null;
Response mimeResponse = null;
for(Element link: links){
String linkurl =link.absUrl("href");
if(!linkurl.contains("#")){
if(DownloadRepository.curlExists(link.absUrl("href"))){
continue;
}
mimeConn = (HttpConnection) Jsoup.connect(linkurl);
mimeConn.ignoreContentType(true);
mimeConn.ignoreHttpErrors(true);
mimeResponse =(Response) mimeConn.execute();
WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
String contentType = mimeResponse.contentType();
if(contentType.contains("html")){
page.addToCrawledPages(new WebPage(webUrl));
}else if(contentType.contains("image")){
page.addToImages(new WebImage(webUrl));
}else{
page.addToFiles(new WebFile(webUrl));
}
DownloadRepository.addCrawledURL(linkurl);
}**
UPDATE Based on Yoshi's answer, I was able to get my code to work right. Here's the link:
Upvotes: 2
Views: 8928
Reputation: 1779
Use jSoup i think this API is good enough for your purpose. Also you can find good Cookbook on this site.
Several steps:
You don't need to use URLConnection class, jSoup have wrapper for it.
e.g
You can use only one line of code to get DOM object:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Instead of this code:
URL oracle = new URL("http://www.oracle.com/");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
Update1 try to add in your code next lines:
Connection.Response res = Jsoup.connect("http://en.wikipedia.org/").execute();
String pageContentType = res.contentType();
Upvotes: 5