David Wolever
David Wolever

Reputation: 154594

Get list of matching lexemes from PostgreSQL full text search?

The full text search ranking documentation suggests that

You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.

But I haven't been able to find any examples of how custom ranking functions can be built.

Specifically, I haven't been able to figure out how to extract the list of lexemes in a tsvector which match a given tsquery… something like this:

> SELECT ts_matching_lexemes('cat in the hat'::tsvector, 'cat'::tsquery);
ts_matching_lexems
------------------
'cat':1

So, how can I figure out which lexemes in a tsvector match a given tsquery?

Upvotes: 5

Views: 1431

Answers (1)

Nisan.H
Nisan.H

Reputation: 6352

It looks like the ts_headline function already does this internally, but it's deep in the c source and outputs a string. You can, however, use it to prepare an input for string parsing the result (this is relatively slow compared with the c functions):

Code:

CREATE OR REPLACE FUNCTION ts_matching_lexemes(tsv tsvector, tsq tsquery)
RETURNS TSVECTOR AS
$$

    WITH 
      proc AS (
        SELECT
            ts_headline(tsv::TEXT, tsq, 'StartSel = <;>, StopSel = <;>') tsh
    )
    , parts AS (
        SELECT unnest(regexp_split_to_array(tsh, '<;>')) p FROM proc
    )
    , parts_enum AS (
        SELECT p, lead(p, 1) OVER (), row_number() OVER () FROM parts
    )
    SELECT (string_agg(p || SUBSTRING(split_part(lead, ' ', 1) FROM 2), ' '))::tsvector
    FROM parts_enum
    WHERE row_number % 2 = 0

$$
LANGUAGE SQL;

e.g.:

select ts_matching_lexemes(to_tsvector('cat in the hat'), to_tsquery('cat'))
union
select ts_matching_lexemes(to_tsvector('cats and bikes in the hat'), to_tsquery('cat & bike'))


ts_matching_lexemes
tsvector
-------------------
'cat':1
'bike':3 'cat':1

notes:

  1. passing the text representation of tsvector to ts_headline is to reduce redundant work
  2. it's approximately 10x slower than ts_headline(text, to_tsquery(...)) and can be sped up by removing the CTEs
  3. Of course, a much faster solution would be to add the functionality directly in the c source. This should be as fast as tsvector @@ tsquery

Upvotes: 5

Related Questions