Tadej Magajna
Tadej Magajna

Reputation: 2963

Is json.dumps and json.loads safe to run on a list of any string?

Is there any danger in losing information when JSON serialising/deserialising lists of text in Python?

Given a list of strings lst:

lst = ['str1', 'str2', 'str3', ...]

If I run

lst2 = json.loads(json.dumps(lst))

Will lst always be exactly the same as lst2 (i.e. will lst == lst2 always result to True)? Or are there some special, unusual characters that would break either of these methods?

I'm curious because I'll be dealing with a lot of different and unusual characters from various Unicode ranges and I would like to be absolutely certain that this process is 100% robust.

Upvotes: 2

Views: 2740

Answers (2)

tripleee
tripleee

Reputation: 189679

Depends on what you mean by "exactly the same". We can identify three separate issues:

  • Semantic identity. What you read in is equivalent in meaning to what you write back out, as long as it's well-defined in the first place. Python (depending on version) might reorder dictionary keys, and will commonly prefer Unicode escapes over literal characters for some code points, and vice versa.

    >>> json.loads(json.dumps("\u0050\U0001fea5\U0001f4a9"))
    'P\U0001fea5💩'
    
  • Lexical identity. Nope. As shown above, the JSON representation of Unicode code points can get normalized in different ways, so that \u0050 gets turned into a literal P, and printable emoji may or may not similarly be turned into Unicode escapes, or vice versa.

    (This is distinct from proper Unicode normalization, which makes sure that homoglyphs get turned into the same precise code point.)

  • Garbage in, same garbage out. Nope. If you have invalid input, Python will often tend to crash rather than pass it through, though you can modify some of this by catching errors and/or passing in flags to request less strict behavior.

    >>> json.loads(r'"\u123"')
      File "<stdin>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \uXXXX escape
    
    >>> print(json.loads(r'"\udcff"'))
    ?
    >>> #!? should probably crash or moan instead!
    

You seem to be asking about the first case, but the third can bite your behind badly if you don't make sure you have decided what to do with invalid Unicode data.

The second case would make a difference if you care about the JSON on disk being equivalent between versions; it doesn't seem to matter to you, but future visitors of this question might well care.

Upvotes: 3

Lie Ryan
Lie Ryan

Reputation: 64913

To some degree, yes, it should be safe. Note however that JSON is not defined in terms of byte strings, but rather it's defined in terms of Unicode text. That means before you do it json.parse, you need to decode that string first from whatever text encoding you're using. This Unicode encoding/decoding step may introduce inconsistencies.

The other implicit question you might have may be, will this process round trip. The answer to that is, it usually will, but it depends on the encoding/decoding process. Depending on the processing step, you may be normalising different characters that are considered equivalent in Unicode but composed using different code points. For example, accented characters like å may be encoded as composite characters using letter a and combining characters for the circle, or it may be encoded as the canonical code point of that character.

There's also the issue of JSON escape characters, which looks like "\u1234". Once decoded, Python doesn't preserve whether the characters is originally encoded using JSON escape or as Unicode character, so you'll lose that information as well and the text may not round trip fully in that case.

Apart from those issues in the deep corners of Unicode nerdery regarding equivalent characters and normalisation, encoding and decoding from/to JSON itself is pretty safe.

Upvotes: 1

Related Questions