Mike Girard
Mike Girard

Reputation: 464

Searching on json encoded string in Postgres with Python

I have a db query like so which I am executing in Python on a Postgres database:

"Select * from my_tbl where big_string like '%Almodóvar%'"

However, in the column I am searching on Almodóvar is represented as 'Almod\u00f3var' and so the query returns nothing.

What can I do to make the two strings match up? Would prefer to work with Almodóvar on the Python side rather than the column in the database but I am flexible.

Additional info prompted by comments:

The database uses UTF-8. The field I am querying on is acquired from an external API. The data was retrieved RESTfully as json and then inserted into a text field of the database after a json.dump.

Because the data includes a lot of foreign names and characters, working with it has been a series of encoding-related headaches. If there is a silver bullet for making this data play nice with Python, I would be very grateful to know what that is.

UPDATE 2:

It looks like it's json encoding that created my quandary.

print json.dumps("Almodóvar")

yields

"Almod\u00f3var"

which is what I see when I look at the raw data. However, when I use json.dumps to construct this:

"Select * from my_tbl where big_string like '%Almod\u00f3var%'"

the query still yields nothing. I'm stumped.

Upvotes: 0

Views: 1127

Answers (3)

Kalaji
Kalaji

Reputation: 756

Your issue seems to be from a step before your query. From the time you retrieved the data from the Web service. It could be:

  • The encoding is not set to UTF-8 during your communication with the Web service.
  • The encoding from tmdb.org side is not UTF-8 (I'm not sure).

I would look into these 2 points starting with the second possibility first.

Upvotes: 1

roman
roman

Reputation: 117475

from help(json.dumps):

If ``ensure_ascii`` is false, all non-ASCII characters are not escaped, and
the return value may be a ``unicode`` instance. See ``dump`` for details.

from help(json.loads):

If ``s`` is a ``str`` instance and is encoded with an ASCII based encoding
other than utf-8 (e.g. latin-1) then an appropriate ``encoding`` name
must be specified. Encodings that are not ASCII based (such as UCS-2)
are not allowed and should be decoded to ``unicode`` first.

so try something like

>>> js = json.dumps("Almodóvar", ensure_ascii=False)  
>>> res = json.loads(js, encoding="utf-8")
>>> print res
Almodóvar

Upvotes: 2

Hyperboreus
Hyperboreus

Reputation: 32439

Set the character encoding of your postgres tables to utf-8, then it will integrate smoothly with python. Without any need for converting to and fro. Your problem looks like you are using two different encodings for your python code and your DB.

Edit: Almod\u00f3var looks to me like windows code page 1252.

Upvotes: 0

Related Questions