Why is parsing YAML slower than an equivalent/larger amount of JSON?

Question

I launched this code, just two hours ago:

python -c "import sys, yaml, json; json.dump(yaml.load(sys.stdin), sys.stdout, indent=4)" < input.yaml > output.json

I have a YAML file of 450 MB and it is really incredible the amount of time wasting in this conversion (actually is still converting) Before I used a YAML test file of 2 MB to test the code (the code found on internet). I received a converted JSON file of 9 MB. This is really strange for me (more space needed for the JSON) but the stranger thing is that python to open a 2 MB YAML file wastes more time than to open a JSON file of 9 MB. The JSON file is instantly opened while the YAML file needs 2 minutes. Can someone explain these things? Where is the magic behind python, JSON and YAML parser?

Anthon · Accepted Answer

YAML is much more complex than JSON, so during parsing YAML has to do many more checks, e.g. whether the next node is switching from block to flow style on input. More effort just takes more time.

So even if you load with the CLoader (which in YAML you have to do explicitly, and which is the default for that JSON standard library), you will be slower.

That the YAML file can be smaller should be obvious from it not needing the (double) quote overhead that is required for JSON. But reading the file is not where the time is spent, it is in parsing, and that is just more difficult for YAML.

The YAML parser (both pure Python and with the help of the CLoader) has lots more memory overhead than the JSON parser, allocating that memory comes at a price of course as well. I would not be suprised if loading your 2Mb YAML file uses 100Mb+ memory.

It is of course also more easy to optimize a parser for a smaller grammar, the CLoader for YAML can definitely be improved upon, but doing so takes a lot of time.

There is also the difference that even using json.load(), the whole file gets read into memory and then parsed (load(), just calls loads() under the hood). YAML parses a stream, and can in principle handle files larger than memory/swap-space, there is some overhead there as well. On the other hand, parsing YAML has quite a bit of overhead, so large files are going to be a problem there as well.

Why is parsing YAML slower than an equivalent/larger amount of JSON?

Answers (1)

Related Questions