theyetiman
theyetiman

Reputation: 8888

How do you configure Lucene in Sitecore to only index the latest version of an item on the master db?

I recognise this is a moot point on the web database, so this question applies to the master db...

I have a custom index set up in Sitecore 6.4.1 as follows:

<index id="search_content_US" type="Sitecore.Search.Index, Sitecore.Kernel">
    <param desc="name">$(id)</param>
    <param desc="folder">_search_content_US</param>
    <Analyzer ref="search/analyzer" />
    <locations hint="list:AddCrawler">
        <search_content_home type="Sitecore.Search.Crawlers.DatabaseCrawler, Sitecore.Kernel">
            <Database>master</Database>
            <Root>/sitecore/content/usa home</Root>
            <Tags>home content</Tags>
        </search_content_home>
    </locations>
</index>

I query the index like this (I am using techphoria414's SortableIndexSearchContext from this answer: How to sort/filter using the new Sitecore.Search API):

private SearchHits GetSearchResults(SortableIndexSearchContext searchContext, string searchTerm)
    {
        CombinedQuery query = new CombinedQuery();
        query.Add(new FullTextQuery(searchTerm), QueryOccurance.Must);
        return searchContext.Search(query, Sort.RELEVANCE);
    }

...

SearchHits hits = GetSearchResults(searchContext, searchTerm);

hits is a collection of search hits from my index. When I iterate through hits I can see that there are many duplicates of the same items in Sitecore, 1 per version of the item.

I then do the following to get a SearchResultCollection:

SearchResultCollection results = hits.FetchResults(0, hits.Length);

This combines all of the duplicates into a single SearchResult object. This object represents 1 version of a particular item, and has a property called SubResults which is a collection of SearchResults that represent all of the other item versions.

Here's my problem:

The version of the item represented by the SearchResult is NOT the current published version of the item! It appears to be a randomly selected version (whichever the search method hit first in the index). The latest version is included in the SubResults collection, however.

E.g.:

SearchResult
 |
 |- Version 8 // main result
 ...
 |- SubResults
      |
      |- Version 9 // latest version
      |- Version 3
      |- Version 5
      ... // all versions in random order

How do I prevent this from happening on the master db? Either by preventing Lucene from indexing old versions of items, or by doing some manipulation of the result set to get the latest version from the SubResults?

As an aside, why does Lucene bother to index old versions of items anyway? Surely this is pointless for searching content on your website as the old versions are not visible?

Upvotes: 6

Views: 4986

Answers (6)

Fahad
Fahad

Reputation: 41

I ended up figuring out an alternate solution from the above answers,

Architecturally speaking, I think the ideal solution for this problem would be to filter out the older version results using custom code at higher level rather than removing them from the master database index altogether. you don't want to manage the way sitecore is designed to work to solve problem at hand.

Use below predicate to filter out the olderversions and retrieve only latest version

predicate.And(item=>item[Sitecore.ContentSearch.BuiltinFields.LatestVersion].Equals("1"));

Hope this helps someone !

Upvotes: 0

Stijn De Vos
Stijn De Vos

Reputation: 311

In Sitecore 7 a field _latestversion was added to the index, containing a '1' for the latest version (other versions have empty value).

Upvotes: 8

Andrew Burgess
Andrew Burgess

Reputation: 5298

You can implement a custom crawler that overrides the following:

public class IndexCrawler : DatabaseCrawler
{
    protected override void IndexVersion(Item item, Item latestVersion, Sitecore.Search.IndexUpdateContext context)
    {
        if (item.Versions.Count > 0 && item.Version.Number != latestVersion.Version.Number)
            return;

        base.IndexVersion(item, latestVersion, context);
    }
}

This ensures that only the latest version of an item gets into your Index, and therefore will be the only item pull out of said index

You would need to update your configuration file to set the correct type for the index of course

Upvotes: 10

Rian
Rian

Reputation: 146

Although the solution provided by theyetiman, by using an adjusted sort mechanism, is an interesting approach, it does not provide a perfect solution when the Lucene result scores for the two versions tend to differ. E.g. out of v1 with score 0.7, and v2 with score 0.5, his solution will still return the first version of the item. (At least in my tests.)

After some more digging, the most obvious solution apparently lies in implementing your own Sitecore.Pipelines.Search.SearchSystemIndex and using that one instead of the default. If you decompile that code using ILSpy or similar, you will notice the following at the bottom of the Process method:

foreach (SearchResult current in searchHits.FetchResults(0, searchHits.Length)){
  // ...
}

Each such SearchResult is actually group-by, where the first result that was returned from Lucene (thus the one with the highest score) is the main result. Hits on other versions (and also other languages) of the same item are accessible through the Subresults property of each instance; or null when there are none.

Depending on your requirements, you can adjust this part of the class to fit your needs.

Upvotes: 2

theyetiman
theyetiman

Reputation: 8888

Whilst I haven't figured out the exact answer (to stop Lucene indexing old versions on the master db) I have come up with an acceptable work-around...

When Lucene returns its results from the index, each hit has a field called "_id" which is formatted something like this (3 versions of the same item, where the last number is the version):

"CCB75380-4E9A-4921-99EC-65E532E330FF%en%1"
"CCB75380-4E9A-4921-99EC-65E532E330FF%en%2"
"CCB75380-4E9A-4921-99EC-65E532E330FF%en%3"
...

I'm currently sorting by Sort.RELEVANCE which is the default. This is fine if we only had one version of an item in the index, but with several almost identical versions, they all have the same relevance score and Lucene just churns them out in any order. Sitecore then takes the first instance of the item version (even if it's old).

The solution is to specify a secondary sort field. In the searchContext.Search() method, you can pass a custom Sort object.

searchContext.Search(query, new Sort(...));

By sorting by Lucene's built in Sort.RELEVANCE first, and then by the id field (descending) in the index, I can ensure that the first hit that Sitecore sees will be the latest version and not just a random one:

searchContext.Search(query, new Sort
                            (
                                new SortField[2] 
                                {
                                    SortField.FIELD_SCORE, // equivalent to Sort.RELEVANCE
                                    new SortField("_id",SortField.STRING, true) // sort by _id, descending
                                }
                            )
);

The SortField parameters are as follows:

SortField(string fieldName, int type, bool reverse)

This approach has fixed my problem, but if anyone can actually find out how to only index the latest version, please answer!

Upvotes: 0

Martijn van der Put
Martijn van der Put

Reputation: 4082

If you let Lucene search in your Web database instead of the Master, it should only index the last published version.

<Database>web</Database>

Upvotes: 7

Related Questions