Reputation: 24478
I'm struggling with using ts_stat to get the number of unique occurrences of tags in a table and sort them by the highest count. What I need though is to only count each entry one time so that only unique entries are counted. I tried group by and distinct but nothing is working for me.
e.g. table
user_id | tags | post_date
===================================
2 | dog cat | 1580049400
2 | dog | 1580039400
3 | dog | 1580038400
3 | dog dog cat | 1580058400
4 | dog horse | 1580028400
Here is the current query
SELECT word, ndoc, nentry
FROM ts_stat($$SELECT to_tsvector('simple', tags) FROM tags WHERE post_date > 1580018400$$)
ORDER BY ndoc DESC
LIMIT 10;
Right now this will produce
word | ndoc | nentry
====================
dog | 5 | 6
cat | 2 | 2
horse| 1 | 1
The result I would be looking for is unique counts so no 1 user can count more than once even if they have > 1 entries after a certain date as noted in the post_date condition (Which might be irrelevant). Like below.
word | total_count_per_user
===========================
dog | 3 (because there are 3 unique users with this term)
cat | 2 (because there are 2 unique users with this term)
horse| 1 (because there are 1 unique users with this term)
UPDATE: I changed the column name to reflect output. The point is no matter how many times a user enters a word. It only needs the unique count per user. e.g. if a user in that scenario creates 100 entries with dog in the text it will only count dog 1 time for that user not 100 counts of dog.
Upvotes: 0
Views: 386
Reputation: 24478
I guess this one was tough. Just in case someone happens to have a similar requirement I was able to get this to work. Seems odd to have to get total with ts_stat then filter it again using distinct, cross join etc so that no matter how many times it finds a word each user only counts once per word. I'm not sure how efficient it will be on a large data set but it yields the expected results.
UPDATE: This is works without using a CTE. Also cross join is the key to filtering on user id.
SELECT DISTINCT (t.word) as tag, count(DISTINCT h.user_id) as posts
FROM ts_stat($$SELECT hashtagsearch FROM tagstable WHERE post_date > 1580018400$$) t
CROSS JOIN tagstable h WHERE hashtagsearch @@ to_tsquery('simple',t.word)
GROUP BY t.word HAVING count(DISTINCT h.user_id) > 1 ORDER BY posts DESC LIMIT 10'
This answer helped quite a bit. https://stackoverflow.com/a/42704207/330987
Upvotes: 0
Reputation: 16908
You can use COUNT on DISTINCT value if I get your point correct. The sample query is as below-
SELECT tags,COUNT(DISTINCT user_id)
FROM your_table
GROUP BY tags
Upvotes: 1