Babu James
Babu James

Reputation: 2843

Postgres full text search ignore url

I am trying to use PostgreSQL to implement a full-text search system.

I encounter this strange or may be intended feature with that.

While trying to index or search for a column which contains names of files with extension (e.g. myimage.jpg), the system treats it as a url and does not properly tokenize.

I referred to the documentation and see that via ts_debug that the file name is taken as a host of a url.

Could some one tell how to take all inputs as normal word in the FTS of PostgreSQL.

Also, on a second request, how can one do a contains, startswith, and endswith searches with it?

Update

I have now tried the statement create text search configuration..., copied from pg_catalog.english and removed host,url, and url_path and then specified the configuration for the ts_debug method. But still no go., myimage.jpg is still identified as host.

Version I use version 9.4

Upvotes: 4

Views: 1063

Answers (1)

Nathan
Nathan

Reputation: 51

tl;dr Look at pre-parsing your input and removing punctuation if you really only want words (and not emails, urls, hosts, etc).

So after trying to figure this out myself the issue is that you don't seem to be able to easily customise the parser. From my understanding the parser runs first, which generates tokens. Those tokens are then matched to dictionaries.

By removing host, url, url_path from the configuration all you are doing is making it so that these tokens don't get looked up in a dictionary, resulting in no lexeme from these tokens. Which essentially means that they don't exist in terms of search. Which is not want you want...

Ideally what you need to do is customise the parser to not generate those tokens in the first place, or to also generate overlapping tokens (similar to how hyphenated words generate a token for the entire word as well as individual components) . This doesn't seem to be possible at the moment without writing a custom parser.

The only solution to this would be to pre-parse the text to remove the full stop. Note that if you rely on other types of tokens like version (e.g. 8.3.0) or email (e.g. [email protected]) this will break those. So you may need to be a bit clever on how you remove characters.

select ts_debug('english', replace('this-is-a-file.jpg', '.', ' '));

"(asciihword,"Hyphenated word, all ASCII",this-is-a-file,{english_stem},english_stem,{this-is-a-fil})"
"(hword_asciipart,"Hyphenated word part, all ASCII",this,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",is,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",a,{english_stem},english_stem,{})"
"(blank,"Space symbols",-,{},,)"
"(hword_asciipart,"Hyphenated word part, all ASCII",file,{english_stem},english_stem,{file})"
"(blank,"Space symbols"," ",{},,)"
"(asciiword,"Word, all ASCII",jpg,{english_stem},english_stem,{jpg})"

In terms of your second question. Are you talking about partial word matches? You get this a little bit with the stemming when using a config like english, so running becomes run which will match if you search for run or running. If you're talking about fuzzy matching it gets a little more complicated. I suggest reading this article http://rachbelaid.com/postgres-full-text-search-is-good-enough/

Upvotes: 2

Related Questions