Swier
Swier

Reputation: 4186

Writing pandas DataFrame to JSON in unicode

I'm trying to write a pandas DataFrame containing unicode to json, but the built in .to_json function escapes the non-ascii characters. How do I fix this?

Example:

import pandas as pd

df = pd.DataFrame([["τ", "a", 1], ["π", "b", 2]])
df.to_json("df.json")

This gives:

{"0":{"0":"\u03c4","1":"\u03c0"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}

Which differs from the desired result:

{"0":{"0":"τ","1":"π"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}

I have tried adding the force_ascii=False argument:

import pandas as pd

df = pd.DataFrame([["τ", "a", 1], ["π", "b", 2]])
df.to_json("df.json", force_ascii=False)

But this gives the following error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u03c4' in position 11: character maps to <undefined>

This occurs on pandas versions 0.18 to 2.2+, on python 3.4 to 3.12+

Upvotes: 51

Views: 50549

Answers (4)

user4043890
user4043890

Reputation:

This works fine for Mac OS

df.to_json('df.json', force_ascii=False)

Upvotes: 2

stackoverflowed
stackoverflowed

Reputation: 917

I had the same issue, although I wasn't writing to file. The solution was to encode the string as 'utf-8': df.to_json(force_ascii=False).encode('utf-8')

Upvotes: -2

Lukas
Lukas

Reputation: 2312

There is also another way of doing the same. Because JSON consists of keys (strings in double quotes) and values (strings, numbers, nested JSONs or arrays) and because it's very similar to Python's dictionaries, then you can use simple conversion and string operations to get JSON from Pandas DataFrame

import pandas as pd
df = pd.DataFrame([['τ', 'a', 1], ['π', 'b', 2]])

# convert index values to string (when they're something else - JSON requires strings for keys)
df.index = df.index.map(str)
# convert column names to string (when they're something else - JSON requires strings for keys)
df.columns = df.columns.map(str)

# convert DataFrame to dict, dict to string and simply jsonify quotes from single to double quotes  
js = str(df.to_dict()).replace("'", '"')
print(js) # print or write to file or return as REST...anything you want

Output:

{"0": {"0": "τ", "1": "π"}, "1": {"0": "a", "1": "b"}, "2": {"0": 1, "1": 2}}

UPDATE: Based on note from @Swier (thank you) there could be a problem with strings containing double quotes in the original dataframe. df.jsonify() would escape them (i.e. '"a"' would produce "\\"a\\"" in JSON format). With help of small update in the string approach is possible to handle this too. Complete example:

import pandas as pd

def run_jsonifier(df):
    # convert index values to string (when they're something else)
    df.index = df.index.map(str)
    # convert column names to string (when they're something else)
    df.columns = df.columns.map(str)

    # convert DataFrame to dict and dict to string
    js = str(df.to_dict())
    #store indices of double quote marks in string for later update
    idx = [i for i, _ in enumerate(js) if _ == '"']
    # jsonify quotes from single to double quotes  
    js = js.replace("'", '"')
    # add \ to original double quotes to make it json-like escape sequence 
    for add, i in enumerate(idx):
        js = js[:i+add] + '\\' + js[i+add:] 
    return js

# define double-quotes-rich dataframe
df = pd.DataFrame([['τ', '"a"', 1], ['π', 'this" breaks >>"<""< ', 2]])

# run our function to convert dataframe to json
print(run_jsonifier(df))
# run original `to_json()` to see difference
print(df.to_json())

Output:

{"0": {"0": "τ", "1": "π"}, "1": {"0": "\"a\"", "1": "this\" breaks >>\"<\"\"< "}, "2": {"0": 1, "1": 2}}
{"0":{"0":"\u03c4","1":"\u03c0"},"1":{"0":"\"a\"","1":"this\" breaks >>\"<\"\"< "},"2":{"0":1,"1":2}}

Upvotes: 2

Swier
Swier

Reputation: 4186

Opening a file with the encoding set to utf-8, and then passing that file to the .to_json function fixes the problem:

with open('df.json', 'w', encoding='utf-8') as file:
    df.to_json(file, force_ascii=False)

gives the correct:

{"0":{"0":"τ","1":"π"},"1":{"0":"a","1":"b"},"2":{"0":1,"1":2}}

Note: it does still require the force_ascii=False argument.

Upvotes: 81

Related Questions