Steffen Roller
Steffen Roller

Reputation: 3494

Postgres fulltext ignores xml tags

I'm working on web app which enables the user to search within a source repository. The program parses the diffs. I can't find a way to inject all parts of the diff into the Postgres' fulltext vector.

Example:

select alias, description, token from ts_debug('Link to <a href="//www.yahoo.com">Yahoo!</a> web site');
+-----------+-----------------+----------------------------+
|   alias   |   description   |           token            |
+-----------+-----------------+----------------------------+
| asciiword | Word, all ASCII | Link                       |
| blank     | Space symbols   |                            |
| asciiword | Word, all ASCII | to                         |
| blank     | Space symbols   |                            |
| tag       | XML tag         | <a href="//www.yahoo.com"> |
| asciiword | Word, all ASCII | Yahoo                      |
| blank     | Space symbols   | !                          |
| tag       | XML tag         | </a>                       |
| blank     | Space symbols   |                            |
| asciiword | Word, all ASCII | web                        |
| blank     | Space symbols   |                            |
| asciiword | Word, all ASCII | site                       |
+-----------+-----------------+----------------------------+

It seems to be parsed ok. But if I turn it into a document vector the XML tag won't be included.

select to_tsvector('simple', 'Link to <a href="//www.yahoo.com">Yahoo!</a> web site') to_tsvector;
+--------------------------------------------+
|                to_tsvector                 |
+--------------------------------------------+
| 'link':1 'site':5 'to':2 'web':4 'yahoo':3 |
+--------------------------------------------+

I guess it has something to do with the configuration?

Any ideas?

Upvotes: 0

Views: 202

Answers (1)

jjanes
jjanes

Reputation: 44237

The parser parses out tags, but the default configuration 'simple' ignores them (as can be seen in psql by running \dF+ simple, tokens not listed are ignored).

You can tell it not to ignore them:

alter text search configuration simple add mapping for tag with simple;

But you would probably be better off copying the configuration and then modifying the copy.

You might also need a custom dictionary to process the tags, since the 'simple' dictionary is unlikely to do what you want.

select to_tsvector('simple', 'Link to <a href="//www.yahoo.com">Yahoo!</a> web site') to_tsvector;
                                    to_tsvector                                     
------------------------------------------------------------------------------------
 '</a>':5 '<a href="//www.yahoo.com">':3 'link':1 'site':7 'to':2 'web':6 'yahoo':4

Upvotes: 2

Related Questions