Panama Jack
Panama Jack

Reputation: 24478

Count unique entries for ts_stat count in full text search

I'm struggling with using ts_stat to get the number of unique occurrences of tags in a table and sort them by the highest count. What I need though is to only count each entry one time so that only unique entries are counted. I tried group by and distinct but nothing is working for me.

e.g. table

user_id | tags         | post_date
===================================
2       | dog cat      | 1580049400
2       | dog          | 1580039400
3       | dog          | 1580038400
3       | dog dog cat  | 1580058400
4       | dog horse    | 1580028400

Here is the current query

SELECT word, ndoc, nentry
FROM   ts_stat($$SELECT to_tsvector('simple', tags) FROM tags WHERE post_date > 1580018400$$) 
ORDER  BY ndoc DESC
LIMIT  10;

Right now this will produce

word | ndoc | nentry
====================
dog  | 5    | 6
cat  | 2    | 2
horse| 1    | 1

The result I would be looking for is unique counts so no 1 user can count more than once even if they have > 1 entries after a certain date as noted in the post_date condition (Which might be irrelevant). Like below.

word | total_count_per_user
===========================
dog  | 3    (because there are 3 unique users with this term)
cat  | 2    (because there are 2 unique users with this term)
horse| 1    (because there are 1 unique users with this term)

UPDATE: I changed the column name to reflect output. The point is no matter how many times a user enters a word. It only needs the unique count per user. e.g. if a user in that scenario creates 100 entries with dog in the text it will only count dog 1 time for that user not 100 counts of dog.

Upvotes: 0

Views: 386

Answers (2)

Panama Jack
Panama Jack

Reputation: 24478

I guess this one was tough. Just in case someone happens to have a similar requirement I was able to get this to work. Seems odd to have to get total with ts_stat then filter it again using distinct, cross join etc so that no matter how many times it finds a word each user only counts once per word. I'm not sure how efficient it will be on a large data set but it yields the expected results.

UPDATE: This is works without using a CTE. Also cross join is the key to filtering on user id.

SELECT DISTINCT (t.word) as tag, count(DISTINCT h.user_id) as posts 
FROM ts_stat($$SELECT hashtagsearch FROM tagstable WHERE post_date > 1580018400$$) t 
CROSS JOIN tagstable h WHERE hashtagsearch @@ to_tsquery('simple',t.word)
GROUP BY t.word HAVING count(DISTINCT h.user_id) > 1 ORDER BY posts DESC LIMIT 10'

This answer helped quite a bit. https://stackoverflow.com/a/42704207/330987

Upvotes: 0

mkRabbani
mkRabbani

Reputation: 16908

You can use COUNT on DISTINCT value if I get your point correct. The sample query is as below-

SELECT tags,COUNT(DISTINCT user_id)
FROM your_table
GROUP BY tags

Upvotes: 1

Related Questions