MrE
MrE

Reputation: 20798

issue with encoding when importing json into Postgres

I'm using pandas, and exporting data as json like this:

import pandas as pd
df = pd.DataFrame({'a': ['Têst']})

df.to_json(orient='records', lines=True)
> u'{"a":"T\\u00east"}'

This makes sense since we have a Unicode character 00ea prefixed with \u and it is escaped with \ when converted to JSON

But then I import the JSON strings into Postgres with COPY

buffer = cStringIO.StringIO()
buffer.write(df.to_json(orient='records', lines=True))
buffer.seek(0)

with connection.cursor() as cursor:
  cursor.copy_expert(sql="""
  COPY tmp (json_data) FROM STDIN WITH NULL AS '' ENCODING 'UTF8';
  """, file=buffer)

The problem is that the result in the database ends up being

{"a": "Tu00east"}

and as you can see the double \\ is gone.

I tried using CSV as the COPY mode, but it messes things up since there are commas in some of the data, and trying to set ESCAPE character and DELIMITER to something else always seem to cause failures.

The table column has a jsonb type. I read in the docs that PG doesn't like non-ASCII Unicode over \x003f unless the DB encoding is UTF8, which it is in my case, so that shouldn't be an issue.

I'm trying to figure out why the escaping characters are removed here, and how to import into Postgres and conserve the encoding.

Upvotes: 0

Views: 2673

Answers (1)

AdamKG
AdamKG

Reputation: 14091

Use the csv option for COPY, with DELIMITER e'\x01' QUOTE e'\x02'. I'm not sure whether this works for all possible valid JSON, but I've never had it fail.

$ psql -X testdb -c 'create table t(d jsonb)'
CREATE TABLE
$ cat foo.json
{"a":"Têst"}
$ cat foo.json | psql -X testdb -c "COPY t from stdin csv delimiter e'\x01' quote e'\x02'" 
COPY 1
$ psql -X testdb -c 'select * from t';                                                    
       d       
---------------
 {"a": "Têst"}
(1 row)

Upvotes: 3

Related Questions