Reputation: 209
By default SOLR accepts XML files, I want to perform search on millions of crawled URLS (html).
Upvotes: 3
Views: 9875
Reputation: 3154
You can index downloaded html file with solr very well.
This is the fastest way that I did my indexing:
curl http://localhost:8080/solr/update/extract?stream.file=/home/index.html&literal.id=www.google.com
Here stream.file
is the local path of your html file and literal.id
is url from index.html
.
Upvotes: 1
Reputation: 23098
Solr CEL can accept HTML and indexes them for full-text search: http://wiki.apache.org/solr/ExtractingRequestHandler
curl "http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" -F "[email protected]"
Upvotes: 1
Reputation: 2746
Usually, the first step I would recommend rolling your own application using SolrJ or similar to handle the indexing, and not do it directly with the DataImportHandler.
Just write your application and have that output the contents of those web pages as a field in a SolrInputDocument. I recommend stripping the HTML in that application, because it gives you greater control. Besides, you probably want to get at some of the data inside that pag, such as <title>
, and index it to a different field. An alternative is to use HTMLStripTransformer on one of your fields to make sure it strips HTML out of anything that you send to that field.
How are you crawling all this data? If you're using something like Apache Nutch it should already take care of most of this for you, allowing you to just plug in the connection details of your Solr server.
Upvotes: 2