Marc
Marc

Reputation: 1043

ensure if hashtag matches in search, that it matches whole hashtag

I have an app that utilizes hashtags to help tag posts. I am trying to have a more detailed search.

Lets say one of the records I'm searching is:

The #bird flew very far.

When I search for "flew", "fle", or "#bird", it should return the record.

However, when I search "#bir", it should NOT return the sentence because the whole the tag being searched for doesn't match.

I'm also not sure if "bird" should even return the sentence. I'd be interested how to do that though as well.

Right now, I have a very basic search:

SELECT "posts".* FROM "posts" WHERE (body LIKE '%search%')

Any ideas?

Upvotes: 3

Views: 1413

Answers (2)

Craig Ringer
Craig Ringer

Reputation: 324521

It might help to parse the hash tags out of the text and store them in an array in a separate column called say hashtags when the articles are inserted/updated. Remove them from the article body before feeding it into to_tsvector and store the tsvector in a column of the table. Then use:

WHERE body_tsvector @@ to_tsquery('search') OR 'search' IN hashtags

You could use a trigger on the table to maintain the hashtags column and the body_tsvector stripped of hash tags, so that the application doesn't have to do the work. Parse them out of the text when entries are INSERTed or UPDATEd.

Upvotes: 3

mu is too short
mu is too short

Reputation: 434685

You could do this with LIKE but it would be rather hideous, regexes will serve you better here. If you want to ignore the hashes then a simple search like this will do the trick:

WHERE body ~ E'\\mbird\M''

That would find 'The bird flew very far.' and 'The #bird flew very far.'. You'd want to strip off any #s before search though as this:

WHERE body ~ E'\\m#bird\M''

wouldn't find either of those results due to the nature of \m and \M.

If you don't want to ignore #s in body then you'd have to expand and modify the \m and \M shortcuts yourself with something like this:

WHERE body ~ E'(^|[^\\w#])#bird($|[^\\w#])'
--   search term goes here^^^^^

Using E'(^|[^\\w#])#bird($|[^\\w#])' would find 'The #bird flew very far.' but not 'The bird flew very far.' whereas E'(^|[^\\w#])bird($|[^\\w#])' would find 'The bird flew very far.' but not 'The #bird flew very far.'. You might also want to look at \A instead of ^ and \Z instead of $ as there are subtle differences but I think $ and ^ would be what you want.

You should keep in mind that none of these regex searches (or your LIKE search for that matter) will uses indexes so you're setting yourself up for lots of table scans and performance problems unless you can restrict the searches using something that will use an index. You might want to look at a full-text search solution instead.

Upvotes: 3

Related Questions