Jerry
Jerry

Reputation: 2567

(Full-Text) Search And Database Design

This is a system architecture question on designing full-text search with (relational) database. The specific software I'm using are Solr and PostgreSQL, just FYI.

Suppose we are building a forum with two users Andy and Betty --

Post ID | User  | Title             | Content
--------|-------|-------------------|---------------------------
1       | Andy  | Dark Knight rocks | Dark Knight rocks blah
2       | Betty | I love Twilight   | Twilight blah blah
3       | Andy  | Twilight sucks    | Twilight sucks blah
4       | Betty | Andy sucks        | Twilight rocks, Andy sucks

When the posts table is indexed in Solr, we can easily return the posts sorted by their relevancy to "?q=twilight" or "?q=dark+night".

Now we want to add a new feature to search for users instead of posts. A naive implementation would simply index user name and return "Andy" to "?q=a" and "Betty" to "?q=b", but what if we want to make our system smarter to also take into account of the user posts and return "Betty" before "Andy" to "?q=twilight" because Betty mentions Twilight more than Andy does.

How would you design the system to efficiently handle the user-search function for hundreds of thousands of users and millions of posts?

Upvotes: 2

Views: 422

Answers (2)

Jesvin Jose
Jesvin Jose

Reputation: 23098

Faceting on User would return number of results per user. If Andy wrote 15 posts that match Twilight while Betty wrote 10, the faceting will return them as such.

But it wont help if both wrote 15 posts about Twilight, but Andy's was supposed to be more relevant; you will see all facet counts (15, 15 in this case) even if you are paginating to see only (say,) top 5 results and Andy made 4 of them.


If above solution is not good enough, consider a background job that writes documents of

type: suggest_user_type (so you can distinguish them by a `fq`)
user: Andy (the user)
concatted_posts: "I think Twilight.." (concatenate the users latest 50 posts)

once a week. And if you

fq=type:suggest_user_type&
q=concatted_posts:twilight&
fl=user

you get a sorted list of users based on relevance of concatted_posts with respect to twilight.

Upvotes: 1

Paul Coghill
Paul Coghill

Reputation: 677

I believe term frequency is included in full text search ranking. It's part of a research area called information retrieval. There's also another value called the inverse document frequency, which filters out common terms.

There are other steps common to ranking text, you may want to have a look at the OpenNLP project if you're interested.

In terms of database design, there's too much to cover in a post and I'm not the one to write it. The general consensus seems to be for very large systems they key is building an efficient index, then distributing this over a number machines to scale performance. I would recommend reading up on Page Rank and how Google developed its systems as a starting point.

Upvotes: 0

Related Questions