Rob
Rob

Reputation: 877

PostgreSQL - convert lexemes to full words via ts_stat

I trying to get counts of words to generate a word cloud type view from my PostgreSQL db. I was hoping to take advantage of the built in efficiencies of ts_vectors. I see that there is a ts_stats function that returns a list of lexemes and their frequencies, given a ts_vector column.

The rows returned look something like this:

word    ndoc    nentry
studi   322     43434
compani 233     2322

And the query looks like this:

SELECT * FROM ts_stat('SELECT tsv from doc')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10 

However, I need the full word, not just the lexemes. Is there a way to convert a lexeme to one of its full-word representations with PostgreSQL? You can easily convert a word to its lexeme, but I can't figure out how to do the reverse.

Thanks.

Upvotes: 4

Views: 3662

Answers (2)

cdalxndr
cdalxndr

Reputation: 1551

It's not possible. You can store the initial text into a separate column.

A lexeme is a normalized representation (using dictionary or stemmer) of a word, so it can match several forms: lexeme 'walk' matches 'walks' and also 'walking', so the original form cannot be determined.

Upvotes: -1

Craig Mitchell
Craig Mitchell

Reputation: 39

Change the configuration of your ts_vector to 'simple'. the default is 'english' which does the stemming e.g.

SELECT * FROM ts_stat('SELECT to_tsvector(''freds company company company hello ridiculous'')');

outputs counts for ridicul, compani, hello, fred

with the 'simple' configuration no stemming occurs

SELECT * FROM ts_stat('SELECT to_tsvector(''simple'',''freds company company company hello ridiculous'')');

outputs counts for : ridiculous,hello,freds,company

Upvotes: 3

Related Questions