Reputation: 20415
I just started using postgreSQL for fuzzy text matching. I have two columns: job_title
and company_name
.
A typical full text search would concatenate job_title
and company_name
, then return search text result according to a single ranking.
However, treating text match in two columns equally can be problematic in my case. For example, Search Engineer
at Google Co.
should not be ranked equally with Google Search
at Engineer Co.
I know that I can assign different weights for each column. However, I don't have a reason to weight one more important than other.
How do I match my keywords against each of the columns separately, and return some "match score" on each keywords?
Something like:
Jobs.where("(to_tsvector('english', position) @@ plainto_tsquery(:q)) AND
(to_tsvector('english',company) @@ plainto_tsquery(:q))", q: "Search Engineer", q: "Google")
Upvotes: 2
Views: 4349
Reputation: 78413
As you've pointed out, you can concatenate tsvectors:
# select to_tsvector('job description') ||
to_tsvector('company as keyword') ||
to_tsvector('job description as body') as vector;
vector
-----------------------------------------------------------
'bodi':9 'compani':3 'descript':2,7 'job':1,6 'keyword':5
(1 row)
And you can also assign them weights:
# select (setweight(to_tsvector('job description'), 'A') ||
setweight(to_tsvector('company as keyword'), 'B') ||
setweight(to_tsvector('job description as body'), 'D')) as vector;
vector
---------------------------------------------------------------
'bodi':9 'compani':3B 'descript':2A,7 'job':1A,6 'keyword':5B
(1 row)
You can also play around with ts_rank_cd()
. In particular, you could change the way the scores are normalized.
http://www.postgresql.org/docs/current/static/textsearch-controls.html
In your case, it seems that you want to combine two separate queries instead. An ugly but possibly adequate solution might look like this:
select sum(rank) as rank, ...
from (
select ...
union all
select ...
) as sub
group by ...
order by sum(rank) desc
limit 10
As you can see, it's not very pretty. It's also a boulevard for aggregating potentially large sets of matching rows. Imho, you'll be better off sticking with the built-in tsvector arithmetic and tweaking weights if needed.
Upvotes: 2