Reputation: 31
I do crawling with nutch 2.2 and the data that i retrieve is the metatag,how to extract the value of specefic div in html with crawling in apache nutch
Upvotes: 2
Views: 867
Reputation: 1715
You will have to write a plugin that will extend HtmlParseFilter to achieve your goal.
You can use some html parser like Jsoup for this and extract URLs that you want and add them as outlinks.
Sample HtmlParseFilter implementation:-
public ParseResult filter(Content content, ParseResult parseResult,
HTMLMetaTags metaTags, DocumentFragment doc) {
// get html content
String htmlContent = new String(content.getContent(), StandardCharsets.UTF_8);
// parse html using jsoup or any other library.
Document document = Jsoup.parse(content.toString(),content.getUrl());
Elements elements = document.select(<your_css_selector_query);
// modify/select only required outlinks
if (elements != null) {
Outlink outlink;
List<String> newLinks=new ArrayList<String>();
List<Outlink> outLinks=new ArrayList<Outlink>();
String absoluteUrl;
Outlink outLink;
for (Element element : elements){
absoluteUrl=element.absUrl("href");
if(includeLinks(absoluteUrl,value)) {
if(!newLinks.contains(absoluteUrl)){
newLinks.add(absoluteUrl);
outLink=new Outlink(absoluteUrl,element.text());
outLinks.add(outLink);
}
}
}
Parse parse = parseResult.get(content.getUrl());
ParseStatus status = parse.getData().getStatus();
Title title = document.title();
Outlink[] newOutLinks = (Outlink[])outLinks.toArray(new Outlink[outLinks.size()]);
ParseData parseData = new ParseData(status, title, newOutLinks, parse.getData().getContentMeta(), parse.getData().getParseMeta());
parseResult.put(content.getUrl(), new ParseText(elements.text()), parseData);
}
//return parseResult with modified outlinks
return parseResult;
}
Build new plugin using ant and add plugin in nutch-site.xml.
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|<custom_plugin>|urlfilter-regex|parse-(tika|html|js|css)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
</property>
And in parser-plugins.xml you can use your custom plugin instead of default plugin used by tika by something like this :-
<!--
<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
</mimeType>
-->
<mimeType name="text/xml">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>
<mimeType name="text/html">
<plugin id="<custom_plugin>" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="<custom_plugin>" />
</mimeType>
Upvotes: 3
Reputation: 1170
You need to override the parsefilter and use Jsoup selector to select particular div.
Upvotes: 2