Memory Based Full Text Search

Question

I have a social feature with a structure similar to a blog: posts and comments.

Posts have a field called body, and so do comments. The posts and comments are stored in a SharePoint list, so direct SQL Full Text Queries are out.

If someone types in "outage restoration efficiency in November", I really have no clue as to how I would properly return a list of posts based on the content of the posts and its attached comments.

The good news is, there will never be more than 50-100 posts I need to search through at a time. Knowing that, the easiest way to solve this would be for me to load the posts and comments into memory and search through them via a loop.

Ideally something like this would be the fastest solution:

class Post
{
    public int Id;
    public string Body;
    public List comments;
}
class Comment
{
    public int Id;
    public int ParentCommentId;
    public int PostId;
    public string Body;
}
public List allPosts;
public List allComments;

public List postsToInclude (string SearchText)
{
    var returnList = new List();
    foreach(Post post in allPosts)
    {
        //check post.Body with bool isThisAMatch(SearchText, post.Body)
        //if post.Body is a good fit, returnList.add(post);
    }
    foreach(Comment comment in allComments)
    {
        //check post.Body with bool isThisAMatch(SearchText, comment.Body)
        //if comment.Body is a good fit, returnList.add(post where post.Id == comment.PostId);
    }
}

public bool isThisAMatch(string SearchText, string TextToSearch)
{
    //return yes or no if TextToSearch is a good match to SearchText
}

Jan · Accepted Answer

That is not a trivial subject. As a machine does not have a concept of "content", it is inherently difficult to retrieve articles that are about a certain topic. To make an educated guess if each article is relevant to your search term(s), you have to use some proxy algorithms, e.g. TF-IDF.

Instead of implementing that yourself, I'd advise to use existing Information Retrieval Libraries. There are a few really popular out there. Based on my own experience, I'd suggest to have a closer look at Apache Lucene. A look at their reference list shows their significance.

If you have never had anything to do with information retrieval, I promise a very steep learning curve. To ease into the whole area, I suggest you use Solr first. It runs pretty much "out of the box" and gives you a nice idea of what is possible. I had a breakthrough when I started to really look at the available filters and each step of the algorithm. After that I had a better understanding of what to change to get better results. Depending on the format of your content, the system might require serious tweaking.

I have spent a LOT of time with Lucene, Solr and some alternatives in my job. The results I got in the end where acceptable, but it was a difficult process. It took a lot of theory, testing and prototyping to get there.

Memory Based Full Text Search

Answers (2)

Related Questions