Web Crawler specifically for downloading images and files

Question

I am doing an assignment for one of my classes.

I am supposed to write a webcrawler that download files and images from a website given a specified crawl depth.

I am allowed to use third party parsing api so I am using Jsoup. I've also tried htmlparser. Both nice softwares but they are not perfect.

I used the default java URLConnection to check content type before processing the url but it becomes really slow as the number of links grows.

Question : Anyone know any specialized parser api for images and links ?

I could start writing mine using Jsoup but am being lazy. Besides why reinvent the wheel if there could be a working solution out there? Any help would be appreciated.

i need to check contentType while looping through the links to check if the link is to a file, in an effective way but Jsoup does not have what i need. Heres what i have: **

    HttpConnection mimeConn =null;
    Response mimeResponse = null;
    for(Element link: links){

        String linkurl =link.absUrl("href");
        if(!linkurl.contains("#")){

            if(DownloadRepository.curlExists(link.absUrl("href"))){
                continue;
            }

            mimeConn = (HttpConnection) Jsoup.connect(linkurl);
            mimeConn.ignoreContentType(true);
            mimeConn.ignoreHttpErrors(true);
            mimeResponse =(Response) mimeConn.execute();

            WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
            String contentType = mimeResponse.contentType();

            if(contentType.contains("html")){
                page.addToCrawledPages(new WebPage(webUrl));
            }else if(contentType.contains("image")){                    
                page.addToImages(new WebImage(webUrl));
            }else{
                page.addToFiles(new WebFile(webUrl));
            }

            DownloadRepository.addCrawledURL(linkurl);

        }**

UPDATE Based on Yoshi's answer, I was able to get my code to work right. Here's the link:

https://github.com/unekwu/cs_nemesis/blob/master/crawler/crawler/src/cu/cs/cpsc215/project1/parser/Parser.java

Ishikawa Yoshi · Accepted Answer

Use jSoup i think this API is good enough for your purpose. Also you can find good Cookbook on this site.

Several steps:

Jsoup: how to get an image's absolute url?
how to download image from any web page in java
You can write your own recursion method which walk through links on page which contains nesessary domain name or relative links. Use this way to grab all links and find all images on it. Write it yourself it's not bad practice.

You don't need to use URLConnection class, jSoup have wrapper for it.

e.g

You can use only one line of code to get DOM object:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();

Instead of this code:

    URL oracle = new URL("http://www.oracle.com/");
    URLConnection yc = oracle.openConnection();
    BufferedReader in = new BufferedReader(new InputStreamReader(
                                yc.getInputStream()));
    String inputLine;
    while ((inputLine = in.readLine()) != null) 
        System.out.println(inputLine);
    in.close();

Update1 try to add in your code next lines:

Connection.Response res = Jsoup.connect("http://en.wikipedia.org/").execute();
String pageContentType = res.contentType();

Web Crawler specifically for downloading images and files

Answers (1)

Related Questions