Reputation: 5621
I have a social feature with a structure similar to a blog: posts and comments.
Posts have a field called body, and so do comments. The posts and comments are stored in a SharePoint list, so direct SQL Full Text Queries are out.
If someone types in "outage restoration efficiency in November", I really have no clue as to how I would properly return a list of posts based on the content of the posts and its attached comments.
The good news is, there will never be more than 50-100 posts I need to search through at a time. Knowing that, the easiest way to solve this would be for me to load the posts and comments into memory and search through them via a loop.
Ideally something like this would be the fastest solution:
class Post
{
public int Id;
public string Body;
public List<Comment> comments;
}
class Comment
{
public int Id;
public int ParentCommentId;
public int PostId;
public string Body;
}
public List<Post> allPosts;
public List<Comment> allComments;
public List<Post> postsToInclude (string SearchText)
{
var returnList = new List<Post>();
foreach(Post post in allPosts)
{
//check post.Body with bool isThisAMatch(SearchText, post.Body)
//if post.Body is a good fit, returnList.add(post);
}
foreach(Comment comment in allComments)
{
//check post.Body with bool isThisAMatch(SearchText, comment.Body)
//if comment.Body is a good fit, returnList.add(post where post.Id == comment.PostId);
}
}
public bool isThisAMatch(string SearchText, string TextToSearch)
{
//return yes or no if TextToSearch is a good match to SearchText
}
Upvotes: 1
Views: 1536
Reputation: 2168
That is not a trivial subject. As a machine does not have a concept of "content", it is inherently difficult to retrieve articles that are about a certain topic. To make an educated guess if each article is relevant to your search term(s), you have to use some proxy algorithms, e.g. TF-IDF.
Instead of implementing that yourself, I'd advise to use existing Information Retrieval Libraries. There are a few really popular out there. Based on my own experience, I'd suggest to have a closer look at Apache Lucene. A look at their reference list shows their significance.
If you have never had anything to do with information retrieval, I promise a very steep learning curve. To ease into the whole area, I suggest you use Solr first. It runs pretty much "out of the box" and gives you a nice idea of what is possible. I had a breakthrough when I started to really look at the available filters and each step of the algorithm. After that I had a better understanding of what to change to get better results. Depending on the format of your content, the system might require serious tweaking.
I have spent a LOT of time with Lucene, Solr and some alternatives in my job. The results I got in the end where acceptable, but it was a difficult process. It took a lot of theory, testing and prototyping to get there.
Upvotes: 4
Reputation: 1177
Note: I have no experience doing this, but this is the way I would approach the problem.
Firstly, I would do your search in parallel. By doing this, it will greatly increase the performance of your searching function.
Secondly, because you are able to input multiple words, I would create a scoring system that scores comment based on the query. For example if the comments have 2 of the query words, it has a higher score value than a comment that only contains 1 of the query words. Or, if it the comment contains an exact match, maybe it might receive a really high score.
Anyway, once all of the comments are scored based on the query input in a parallel loop, show the user the top comments as their results. Additionally note, that this works because of the small data set sizes, 50-100.
Upvotes: -1