Reputation: 31

how to extract the value of specefic div in html with crawling in apache nutch?

I do crawling with nutch 2.2 and the data that i retrieve is the metatag,how to extract the value of specefic div in html with crawling in apache nutch

Upvotes: 2

Answers (2)

Sachin

Reputation: 1715

You will have to write a plugin that will extend HtmlParseFilter to achieve your goal.

You can use some html parser like Jsoup for this and extract URLs that you want and add them as outlinks.

Sample HtmlParseFilter implementation:-

        public ParseResult filter(Content content, ParseResult parseResult,
              HTMLMetaTags metaTags, DocumentFragment doc) {
                // get html content
                String htmlContent = new String(content.getContent(), StandardCharsets.UTF_8);
                // parse html using jsoup or any other library.
                Document document = Jsoup.parse(content.toString(),content.getUrl());
                Elements elements = document.select(<your_css_selector_query);
                // modify/select only required outlinks
                if (elements != null) {
                    Outlink outlink;
                    List<String> newLinks=new ArrayList<String>();
                    List<Outlink> outLinks=new ArrayList<Outlink>();
                    String absoluteUrl;
                    Outlink outLink;
                    for (Element element : elements){
                     absoluteUrl=element.absUrl("href");
                     if(includeLinks(absoluteUrl,value)) {
                        if(!newLinks.contains(absoluteUrl)){
                          newLinks.add(absoluteUrl);
                          outLink=new Outlink(absoluteUrl,element.text());
                          outLinks.add(outLink);
                          }
                        }
                      }
                    Parse parse = parseResult.get(content.getUrl());
                    ParseStatus status = parse.getData().getStatus();
                    Title title = document.title();
                    Outlink[] newOutLinks = (Outlink[])outLinks.toArray(new Outlink[outLinks.size()]);
                    ParseData parseData = new ParseData(status, title, newOutLinks, parse.getData().getContentMeta(), parse.getData().getParseMeta());
                    parseResult.put(content.getUrl(), new ParseText(elements.text()), parseData);
                    }
                   //return parseResult with modified outlinks
                   return parseResult;
            }

Build new plugin using ant and add plugin in nutch-site.xml.

<property>
    <name>plugin.includes</name>
    <value>protocol-httpclient|<custom_plugin>|urlfilter-regex|parse-(tika|html|js|css)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
  </property>

And in parser-plugins.xml you can use your custom plugin instead of default plugin used by tika by something like this :-

<!--
    <mimeType name="text/html">
        <plugin id="parse-html" />
    </mimeType>

        <mimeType name="application/xhtml+xml">
        <plugin id="parse-html" />
    </mimeType>
-->

    <mimeType name="text/xml">
        <plugin id="parse-tika" />
        <plugin id="feed" />
    </mimeType>

    <mimeType name="text/html">
        <plugin id="<custom_plugin>" />
    </mimeType>

                <mimeType name="application/xhtml+xml">
        <plugin id="<custom_plugin>" />
    </mimeType>

Upvotes: 3

Abhishek Ramachandran

Reputation: 1170

You need to override the parsefilter and use Jsoup selector to select particular div.

Upvotes: 2

how to extract the value of specefic div in html with crawling in apache nutch?

Answers (2)

Related Questions