user5268288
user5268288

Reputation:

Postgresql encoding and tsvector (Ubuntu)

EDIT2: I will keep the rest of the text for contextualization, but I've found that the problem has nothing to do with my backups nor my data model. I've managed to replicate the problem doing the following:

createdb -w -T template0 localeTestDb -E UTF8 --locale=pt_PT.utf8

create table test(keywords tsvector);

INSERT INTO test(keywords) VALUES((to_tsvector('portuguese','contemporânea,alho francês')));

select * from test where keywords @@ plainto_tsquery('contemporânea, alho francês');
 keywords 
----------
(0 rows)

I've set encoding to UTF8, created the database with specific locale, on ts_vector used 'portuguese', and I still get no results. Creating the database under the same circunstances ( datcollate and datctype pt_PT.utf8, but without template0) on my PC it works. can the problem be template0? if i try to do

createdb localeTestDb -E UTF8 --locale=pt_PT.utf8

I get.

database creation failed: ERROR: new collation (pt_PT.utf8) is incompatible with the collation of the template database (en_US.UTF-8) HINT: Use the same collation as in the template database, or use template0 as template.

I think it has something to do with encoding and tsvector... I just don't know what.


First of all, I've already searched and I have some ideas, I just need some help since I'm not that familiar with postgresql and Ubuntu. For setup I have my PC (with both windows and Linux15.10) and a remote PC which serves a a server (with ubuntu 14.04.4 LTS) So, I have a database and I have done backups (with encoding UTF8) and I can restore without any issues. That database has a table which has a tsvector type (keywords) and a function that receives an varchar as parameter (along other parameters). On my PC (on both OS) I call this function and get the expected results, e.g:

select * From ufn_get_recipes_by_restrictions(1, '{"contemporânea"}', '{-1}')

returns

id, name          author
1;"Pato Grelhado";"Neuza"

Notice the special character

â

BUT, with the very same scripts to create tables/functions and the same restores, when I call the same function with the same parameters I get 0 rows. For me, the weirdest thing is that I have this row on the server:

                                 keywords                                  
----------------------------------------------------------------------------
 'contemporân':1 'folh':5 'frut':6 'lim':2 'mil':4 'mil-folh':3 'vermelh':7

I mean... there's an â on the keywords table!

doing the

select encoding, datcollate, datctype from pg_database where datname = 'plaza_db';

ON WINDOWS:

encoding  |       datcollate       |       datctype   
----------+-------------+----------------------------------
        6 | Portuguese Brazil.1252 | Portuguese Brazil.1252

ON my UBUNTU:

encoding | datcollate  |  datctype   
----------+-------------+-------------
        6 | pt_PT.utf8 | pt_PT.utf8

ON SERVER:

 encoding | datcollate  |  datctype   
----------+-------------+-------------
        6 | en_US.UTF-8 | en_US.UTF-8

so.. yeah, three different encodings. Portugues Brazil 1252 is <=> to WIN1252, which, as far as I remember reading, is "like" a "subtype" ( I don't really know the sintax here) of UTF8. Since it works with this encoding and pt_PT.utf8, can the problem be about being at en_US.UTF-8? I tried to create with encoding 'LATIN1' and TEMPLATE=template0 but on restore I got an error saying that some special character with UTF8 encoding had no "translation" to LATIN1. What is confusing me is that the rows return with the correct characters, like à, ê, etc etc.. but it fails on search.

What else can I do? I've seen some people saying to change locale and re-install postgresql, but since the server is not mine, that's not really an option.

Really appreciated!

EDIT: I've created a new database just for testing and got this:

INSERT INTO test(keywords) VALUES((to_tsvector('portuguese','contemporânea,alho francês')));
    testencodingdb=# select * from test;      
keywords               
    --------------------------------------
     'alho':2 'contemporân':1 'francês':3
    (1 row)

when doing

select * from test where keywords @@ to_tsquery('(contemporânea)&(alho|francês)');

I get

keywords 
----------
(0 rows)

So, I guess it's a problem with this very specific server. And,

 client_encoding 
-----------------
 UTF8
(1 row) 

Finally, server's locale:

LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=pt_PT.UTF-8
LC_TIME=pt_PT.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=pt_PT.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=pt_PT.UTF-8
LC_NAME=pt_PT.UTF-8
LC_ADDRESS=pt_PT.UTF-8
LC_TELEPHONE=pt_PT.UTF-8
LC_MEASUREMENT=pt_PT.UTF-8
LC_IDENTIFICATION=pt_PT.UTF-8
LC_ALL=

Upvotes: 2

Views: 582

Answers (2)

user5268288
user5268288

Reputation:

Found out the problem. For some reason, on the server I have to give the config parameter on ts_query, in this case, to_tsquery('portuguese', ...)

Upvotes: 1

Chris Travers
Chris Travers

Reputation: 26464

My immediate guess would be that the problem is client_encoding. I would bet it works from Ubuntu as a client but not from Windows? In that case, what is happening is that the special character is getting encoded using the 1252 codepage, but is being interpreted on the other side as UTF8, leading to it effectively querying the wrong string.

Try setting the client encoding specifically to address this problem.

Upvotes: 0

Related Questions