Snoopy
Snoopy

Reputation: 3

How to index crawled "html" from Apache Nutch to Solr?

I want to index the source code of my crawled web pages by Apache Nutch (v1.17) to index in Solr (8.6.3), but don't know how to do that? At least I just get a prepared version indexed to Solr content (see below).

{
  "tstamp":"2020-11-19T08:41:15.908Z",
  "digest":"fdc7532e799d4a3a434be4be67c36bb3b",
  "boost":1.0,
  .
  .
  .
  "content":"Algorithm Engineering Group ....",
 "_version_":16837969286885539843
}

I have already looked at the index-writers.xml, but I still don't know how to do that. Maybe you know how to do that.

Upvotes: 0

Views: 254

Answers (1)

Sebastian Nagel
Sebastian Nagel

Reputation: 2239

The Nutch index tool provides a command-line option to index the raw content of web pages:

$> bin/nutch index
...
-addBinaryContent  index raw/binary content in field `binaryContent`
-base64            use Base64 encoding for binary content
...

Note: be aware of PDF and other binary formats the crawler may visit!

Upvotes: 1

Related Questions