Wesley
Wesley

Reputation: 5621

Memory Based Full Text Search

I have a social feature with a structure similar to a blog: posts and comments.

Posts have a field called body, and so do comments. The posts and comments are stored in a SharePoint list, so direct SQL Full Text Queries are out.

If someone types in "outage restoration efficiency in November", I really have no clue as to how I would properly return a list of posts based on the content of the posts and its attached comments.

The good news is, there will never be more than 50-100 posts I need to search through at a time. Knowing that, the easiest way to solve this would be for me to load the posts and comments into memory and search through them via a loop.

Ideally something like this would be the fastest solution:

class Post
{
    public int Id;
    public string Body;
    public List<Comment> comments;
}
class Comment
{
    public int Id;
    public int ParentCommentId;
    public int PostId;
    public string Body;
}
public List<Post> allPosts;
public List<Comment> allComments;

public List<Post> postsToInclude (string SearchText)
{
    var returnList = new List<Post>();
    foreach(Post post in allPosts)
    {
        //check post.Body with bool isThisAMatch(SearchText, post.Body)
        //if post.Body is a good fit, returnList.add(post);
    }
    foreach(Comment comment in allComments)
    {
        //check post.Body with bool isThisAMatch(SearchText, comment.Body)
        //if comment.Body is a good fit, returnList.add(post where post.Id == comment.PostId);
    }
}

public bool isThisAMatch(string SearchText, string TextToSearch)
{
    //return yes or no if TextToSearch is a good match to SearchText
}

Upvotes: 1

Views: 1536

Answers (2)

Jan
Jan

Reputation: 2168

That is not a trivial subject. As a machine does not have a concept of "content", it is inherently difficult to retrieve articles that are about a certain topic. To make an educated guess if each article is relevant to your search term(s), you have to use some proxy algorithms, e.g. TF-IDF.

Instead of implementing that yourself, I'd advise to use existing Information Retrieval Libraries. There are a few really popular out there. Based on my own experience, I'd suggest to have a closer look at Apache Lucene. A look at their reference list shows their significance.

If you have never had anything to do with information retrieval, I promise a very steep learning curve. To ease into the whole area, I suggest you use Solr first. It runs pretty much "out of the box" and gives you a nice idea of what is possible. I had a breakthrough when I started to really look at the available filters and each step of the algorithm. After that I had a better understanding of what to change to get better results. Depending on the format of your content, the system might require serious tweaking.

I have spent a LOT of time with Lucene, Solr and some alternatives in my job. The results I got in the end where acceptable, but it was a difficult process. It took a lot of theory, testing and prototyping to get there.

Upvotes: 4

user959729
user959729

Reputation: 1177

Note: I have no experience doing this, but this is the way I would approach the problem.

Firstly, I would do your search in parallel. By doing this, it will greatly increase the performance of your searching function.

Secondly, because you are able to input multiple words, I would create a scoring system that scores comment based on the query. For example if the comments have 2 of the query words, it has a higher score value than a comment that only contains 1 of the query words. Or, if it the comment contains an exact match, maybe it might receive a really high score.

Anyway, once all of the comments are scored based on the query input in a parallel loop, show the user the top comments as their results. Additionally note, that this works because of the small data set sizes, 50-100.

Upvotes: -1

Related Questions