BennoDual
BennoDual

Reputation: 6259

Search over two Lucene-Documents

I use Lucene.NET in my project. Now I have a bit a tricky constellation. I have two Entities:

public class Dash {
  public int Id { get; set; }
  public string Description { get; set; }
  public int ActivityId { get; set; }
  public string Username { get; set; }
}

public class Activity {
  public int Id { get; set; }
  public string Subject { get; set; }
}

I am storing the Entity Activity as a Document and Dash as a Document in the Lucene-Index.

Now, I can search for Dash-Entries like

+Description:"Appointment" +Username:"mm"

or Activity-Entries like

+Subject:"Appointment-Invitation"

Now, I have to search for Dash-Entries over both Documents. For example I have to search for all Dash-Entries which are of the Username "mm" and have the string "Appointment" in the Description or a associated Activity-Entity has "Appointment" in the Subject. In SQL (Pseudo) this would be:

... where Dash.UserName = 'mm' and (Dash.Description like 'Appointment%' or Dash.Activity.Subject like 'Appointment%'

Can someone help me, how I can do this with Lucene.NET? Perhaps I have to store the documents in another way in the Lucene.NET-Index?

Upvotes: 1

Views: 188

Answers (1)

AndyPook
AndyPook

Reputation: 2889

you should take care when putting different entity types into the same index

If you search for "id:1" how do you know if you've retrieved a Dash or an Activity?

Either:

  • ensure field names are unique ie "dash_id", "activity_id"
  • add a "_type" field and add "_type:dash" or "_type:activity" as a filter to the search

you cannot do your "join" in a single query at least with the current Lucene.net (3.0.3)

Lucene is a document datastore is some ways like a key-value store. Each doc is "just a bunch of fields".

You can just query for each entity then use Linq to join the two collections. But this can be quite inefficient and memory intensive. All depends on how many results you expect. If the number is low then this is probably the simplest.

However, you can do something fairly decent with two queries and a "synchronized enumerable". Caveat: It's hard to tell what a "Dash" is but looking at the properties I'm going to assume that there are many Dash for each Activity

Pseudo code

// assuming "query" returns a TopDocs
var dashDocs = query "+dash_username:mm +dash_description:Appointment" sort by "dash_ActivityId"
var activityDocs = query "+dash_username:mm +dash_description:Appointment" sort by "activity_Id"

var dashDocsEnum = dashDocs.ScoreDocs.GetEnumerator()
foreach(var activityDocID in activityDocs.ScoreDocs)
{
    if(dashDocsEnum.Current==null)
        break;

    var activityId = GetId(activityDocId.td, "activity_id");
    var dashActivityId = GetId(dashDocsEnum.Current.td, "dash_activityid");

    if(dashActivityId<activityId)
    {
        // spin Dash forward to catch up with Activity
        while(dashActivityId<activityId)
        {
            if(!dashDocsEnum.MoveNext())
                break;
            dashActivityId = GetId(dashDocsEnum.Current.td, "dash_activityid");
        }
    }

    while(dashActivityId==activityId)
    {
        // at this point we have an Activity and a matched Dash
        var fullActivity = GetActivity(activityDocId.td);
        var fullDashActivity = GetDash(dashDocsEnum.Current.td);

        // do something with Activity and Dash

        if(!dashDocsEnum.MoveNext())
            break;
        dashActivityId = GetId(dashDocsEnum.Current.td, "dash_activityid");
    }
}

That was just written off the top of my head, so apologies if it's not quite right :)

The idea is to foreach the activities and then step the dash enumerator forward to keep in sync with the activity. An assumption is that you're storing the property values in Store.YES fields. This approach just gets the id fields until we find a match, then projects the entire object.

Another option

is to treat Lucene as a "Document datastore". Create a class which models the parent-child. So Activity has a property which is a collection of Dash.

Serialize that object into a binary field. Add appropriate fields for searching with Store.No. This means that no join is required, you get the entire object in one hit.

This works if update frequency is low as you'd need to update the entire object rather than just adding a single Dash and relying on the join.

Good luck :)

Upvotes: 1

Related Questions