boredmgr
boredmgr

Reputation: 261

Meaning of "parse-(type1|type2)" under the plugin-includes header of the nutch-site.xml file

In nutch-site.xml, under plugin-includes header, when I write parse-(type1|type2), what does it mean?

Does this mean for each url being fetched by nutch, nutch parses the content first by using type1 parser and then sequentially invokes the type2 parser?

Upvotes: 1

Views: 600

Answers (1)

mana
mana

Reputation: 6547

Your assumption is correct. This is how it works. But keep in mind that each plugin can be assigned a certain content type, or a set of content types. For example the parse-pdf plugin will not parse msword documents.

Upvotes: 1

Related Questions