Reputation: 851
So I have certain amount knowledge on Information retrieval. But I'm not clear what solr is doing for indexing. I know solr is using lucene for indexing. So is solr doing inverted indexing for every documents or just simply index using each document's id? Any explanation could be helpful or please point me to some articles.
Upvotes: 0
Views: 681
Reputation: 4770
To the best of my knowledge the ID field is used by solr as it's a good practice to have a unique field per index. Lucene does not care about such trivial things .
As far as the inverted indexing goes, the process goes along this line : each field in a document is analyzed with it's designated analyzer. After that the list of tokens is placed in the inverted index with the form field_name:token_value , with the pointer updated to contain the new document id (here the id is a lucene internal thing and has nothing to do with the solr id field, you can read more about it whyle studying segments) . All of the field_name:token_value pairs are stored sorted (more about this later)
A pointer to the location for term frequency and other relevant stuff is also stored. Since lucene adopts a read-only policy , every commit a new index is created (called segment) . This also makes it easy to store the term dictionary sorted , assuming you commit on a regular basis .
On deletes, for every segment there is a special deletion file (a bitset) that will basically filter out the deleted documents from any matched queries. On merges, according to the merge policy, segments along with their deletion files may dissapear , being merged into a new segment.
To get a real feel on how the terms look inside the file and to better understand lucene's file format read this post about a human readable text codec : http://blog.mikemccandless.com/2010/10/lucenes-simpletext-codec.html , and this page to learn more about Lucene's file format : http://lucene.apache.org/core/4_8_1/core/org/apache/lucene/codecs/lucene46/package-summary.html#package_description
Hope this helps
Upvotes: 1
Reputation: 27487
The indexing for Solr is controlled on a field by field basis. For a specific field to be indexed (e.g. creating an inverted index for the field) the option "indexed" should be set to "true". How it is indexed (as a text field, a non analyzed string, a date, a number, etc) is a function of the type of the field.
I've included a few links that should be helpful:
http://www.solrtutorial.com/basic-solr-concepts.html
Upvotes: 0