gauravmunjal
gauravmunjal

Reputation: 209

How do I index HTML files into Apache SOLR?

By default SOLR accepts XML files, I want to perform search on millions of crawled URLS (html).

Upvotes: 3

Views: 9875

Answers (3)

Amey Jadiye
Amey Jadiye

Reputation: 3154

You can index downloaded html file with solr very well.

This is the fastest way that I did my indexing:

curl http://localhost:8080/solr/update/extract?stream.file=/home/index.html&literal.id=www.google.com

Here stream.file is the local path of your html file and literal.id is url from index.html.

Upvotes: 1

Jesvin Jose
Jesvin Jose

Reputation: 23098

Solr CEL can accept HTML and indexes them for full-text search: http://wiki.apache.org/solr/ExtractingRequestHandler

curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "[email protected]"

Upvotes: 1

Joel Westberg
Joel Westberg

Reputation: 2746

Usually, the first step I would recommend rolling your own application using SolrJ or similar to handle the indexing, and not do it directly with the DataImportHandler.

Just write your application and have that output the contents of those web pages as a field in a SolrInputDocument. I recommend stripping the HTML in that application, because it gives you greater control. Besides, you probably want to get at some of the data inside that pag, such as <title>, and index it to a different field. An alternative is to use HTMLStripTransformer on one of your fields to make sure it strips HTML out of anything that you send to that field.

How are you crawling all this data? If you're using something like Apache Nutch it should already take care of most of this for you, allowing you to just plug in the connection details of your Solr server.

Upvotes: 2

Related Questions