turnip
turnip

Reputation: 2346

Can PostgreSQL's to_tsvector function return tokens/words and not lexemes?

PostgreSQL's to_tsvector function is extremely useful but in regards to my data set it does a little more than I want it to.

For instance:

select * 
from to_tsvector('english', 'This is my favourite game. I enjoy everything about it.');

produces: 'enjoy':7 'everyth':8 'favourit':4 'game':5

I am not fussed about stop-words getting filtered out, that is fine. But some words get completely ruined, like everything and favourite.

Is there a way to modify this behaviour or is there a different function that does this?

PS: Yes, I can write my own query that does this (and I have) but I want a faster method.

Upvotes: 4

Views: 1986

Answers (1)

Mad Scientist
Mad Scientist

Reputation: 18551

The behaviour that you're seeing and that you don't want is "stemming". If you don't want that, you have to use a different dictionary with to_tsvector. The "simple" dictionary doesn't do stemming, so it should fit your use case.

select * 
from to_tsvector('simple', 'This is my favourite game. I enjoy everything about it.');

results in the following output

'about':9 'enjoy':7 'everything':8 'favourite':4 'game':5 'i':6 'is':2 'it':10 'my':3 'this':1

If you still want to remove stop words, you have to define your own dictionary as far as I can see. See the example below, though you might want to read up on the documentation to make sure this does exactly what you want.

CREATE TEXT SEARCH DICTIONARY only_stop_words (
    Template = pg_catalog.simple,
    Stopwords = english
);
CREATE TEXT SEARCH CONFIGURATION public.only_stop_words ( COPY = pg_catalog.simple );
ALTER TEXT SEARCH CONFIGURATION public.only_stop_words ALTER MAPPING FOR asciiword WITH only_stop_words;
select * 
from to_tsvector('only_stop_words', 'The This is my favourite game. I enjoy everything about it.');

'enjoy':8 'everything':9 'favourite':5 'game':6

Upvotes: 5

Related Questions