Obtain the raw html of pages fetched by Nutch 2.3.1

Question

I'd like to train an NLP model using several Web pages to obtain a good precision. Since I don't have the Web pages, I'm considering to use a Web crawler on Amazon EMR. I'd like to use a distributed, extensible and scalable Open Source solution that is respectful of robots.txt rules. After some research, I decided to adopt Apache Nutch.

I've found this video by Nutch's main contributor Julien Nioche particularly useful to get started. Although I used the latest available version of Hadoop (Amazon 2.7.3) and Nutch (2.3.1), I managed to successfully complete a small example job.

Unfortunately, though, I couldn't find an easy way to retrieve the raw html files from Nutch's output. While looking for a solution to this problem, I've found a few other useful resources (in addition to Nutch's own wiki and tutorial pages).

Some of them (like this answer or this page) suggest to implement a new plugin (or modify an existing one): the overall idea is to to add a few lines of code that actually save to a file the content of any fetched html page before it is sent to a segment.

Others (like this answer) suggest to implement a simple post-processing tool that access the segments, goes through all the records that are included there and saves the content of any them that appears to be an html page to a file.

These resources all contain (more or less precise) instructions and code examples, but I had no luck when I tried to run them because they refer to very old versions of Nutch. Also, all my attempts to adapt them to Nuth 2.3.1 have failed due to lack of resources/documentation.

For instance, I appended the following code to the end of the HtmlParser (core of the parse-html plugin), but all the files that get saved on the specified folder are empty:

String html = root.toString();
if (html == null) {
    byte[] bytes = content.getContent();
    try {
      html = new String(bytes, encoding);
    } catch (UnsupportedEncodingException e) {
        LOG.trace(e.getMessage(), e);
    }
}
if (html != null) {
    html = html.trim();
    if (!html.isEmpty()) {
        if (dumpFolder == null) {
            String currentUsersHomeFolder = System.getProperty("user.home");
            currentUsersHomeFolder = "/Users/stefano";
            dumpFolder = currentUsersHomeFolder + File.separator + "nutch_dump";
            new File(dumpFolder).mkdir();
        }
        try {
            String filename = base.toString().replaceAll("\P{LD}", "_");
            if (!filename.toLowerCase().endsWith(".htm") && !filename.toLowerCase().endsWith(".html")) {
                filename += ".html";
            }
            System.out.println(">> " + dumpFolder+ File.separator +filename);
            PrintWriter writer = new PrintWriter(dumpFolder + File.separator + filename, encoding);
            writer.write(html);
            writer.close();
        } catch (Exception e) {
            LOG.trace(e.getMessage(), e);
        }
    }
}

In the other case, instead, I got the following error (which I like because it mentions prolog but it also puzzles me):

[Fatal Error] data:1:1: Content is not allowed in prolog.

So, before considering to downgrade my setup to Nutch 1.x, my question is: have any of you had to face this problem with a recent version of Nutch and successfully solved it?

If so, can you share it with the community or at least provide some useful pointers to the solution?

Many thanks in advance!

PS: If you wonder how to properly open Nutch sources into IntelliJ, this answer might actually point you towards the right direction.

Julien Nioche · Accepted Answer

glad you found the video useful. If you just need web pages to train a NLP model, why don't you use the CommonCrawl dataset? It contains billions of pages, is free and would save you the hassle of large scale web crawling?

Now to answer your question, you could write a custom IndexWriter and write the content of the pages to whatever you want. I don't use Nutch 2.x as I prefer 1.x as it is faster, has more functionalities and is easier to use (to be honest I actually prefer StormCrawler even more but I am biased). Nutch 1.x has a WARCExporter class which can generate a data dump at the same WARC format used by CommonCrawl; there is also another class for exporting at various formats.

Obtain the raw html of pages fetched by Nutch 2.3.1

Answers (2)

Related Questions