Reputation: 1432
I have Lucene files indexed according to pageIds (UniqueKey). and one document can have multiple pages. Now once user perform some search it gives us pages that matches search criteria.
I am using Lucene.Net 2.9.2
We have 2 problems...
1- The file size is around 800GB and it has 130 million rows (pages) so the search time was really slow (all queries taking more than a min (we only have to return limited rows at a time)
To overcome the performance issue I shifted to SOLR which resolved the performance issue (which is quite strange as I am not using any extra functionality provided by SOLR like sharding etc - so could it be that Lucene.NET 2.9.2 is not really equivalent to performance compared to same version of JAVA??) but now I am having another issue...
2- The individual 'lucene document' is one page but i want to show results 'grouped by' 'real documents'. How many results I should be returned should be configurable based on 'real documents' not 'pages' (coz thats how I want to show to the user).
So lets say I want 20 'real documents' and ALL pages in them that matches the search criteria (doesnt matter if one document has 100 pages and another just 1).
From what I could get from SOLR forums was that it can be achieved by SOLR-236 patch (field collapsing) but I have not been able to apply the patch correctly with trunk (gives lots of errors).
This is really imp for me and I dont have much time, so can someone please either send me the SOLR 1.4.1 binary with this patch applied or guide me if there is any other way.
I would really appreciate it. Thanks!!
Upvotes: 1
Views: 652
Reputation: 8560
You could also have a look at SOLR-1682 : Implement CollapseComponent, I havent tested it yet, but as far as I know it solves the collapsing too.
Upvotes: 0
Reputation: 4218
If your only requirement is to show page numbers, I would suggest to play with the highlighter or made some custom development. You can store the word number of the start and end of each page in a custom structure, and knowing the matched word position in the whole document you can know in what page it appears. If the documents are very large you will get a good performance improvement.
Upvotes: 0
Reputation: 99750
If you have issues with the collapse patch, then the Solr issue tracker is the channel to report them. I can see that other people are currently having some issues with it, so I suggest getting involved in its development.
That said: I recommend that if your application needs to search for 'real documents', then build your index around these 'real documents', not their individual pages.
Upvotes: 0