Load specific PyYAML documents from file

Question

I have a .yml file, and I'm trying to load certain documents from it. I know that:

print yaml.load(open('doc_to_open.yml', 'r+'))

will open the first (or only) document in a .yml file, and that:

for x in yaml.load_all(open('doc_to_open.yml', 'r+')):
    print x

which will print all the YAML documents in the file. But say I just want to open the first three documents in the file, or want to open the 8th document in the file. How would I do that?

Anthon · Accepted Answer

If you don't want to parse the first seven YAML files at all, e.g. for efficiency reasons, you will have to search for the 8th document yourself.

There is the possibility to hook into the first stage of the parser and count the number of DocumentStartTokens() within the stream, and only start passing on the tokens after the 8th and stopping to do so on the 9th, but doing that is far from trivial. And that would still tokenize, at least, all of the preceding documents.

The completely inefficient way, for which an efficient replacement, IMO, would need to behave the same, would be to use .load_all() and select the appropriate document, after complete tokenizing/parsing/composing/resolving all of the documents ¹:

import sys
import ruamel.yaml

yaml = ruamel.yaml.YAML()
for idx, data in enumerate(yaml.load_all(open('input.yaml'):
    if idx == 7:
        yaml.dump(data, sys.stdout)

If you run the above on a document input.yaml:

---
document: 0
---
document: 1
---
document: 2
---
document: 3
---
document: 4
---
document: 5
---
document: 6
---
document: 7   # < the 8th document
---
document: 8
---
document: 9
...

you get the output:

document: 7   # < the 8th document

You unfortunately cannot naively just count the number of document markers (---), as the document doesn't have to start with one:

document: 0
---
document: 1
.
.

nor does it have to have the marker on the first line if the file starts with a directive ²:

%YAML 1.2
---
document: 0
---
document: 1
.
.

or starts with a "document" consisting of comments only:

# the 8th document is the interesting one
---
document: 0
---
document: 1
.
.

To account for all that you can use:

def get_nth_yaml_doc(stream, doc_nr):
    doc_idx = 0
    data = []
    for line in stream:
        if line == u'---
' or line.startswith('--- '):
            doc_idx += 1
            continue
        if line == '...
':
            break
        if doc_nr < doc_idx:
            break
        if line.startswith(u'%'):
            continue
        if doc_idx == 0:  # no initial '---' YAML files don't start with
            if line.lstrip().startswith('#'):
                continue
            doc_idx = 1
        if doc_idx == doc_nr:
            data.append(line)
    return yaml.load(''.join(data))

with open("input.yaml") as fp:
    data = get_nth_yaml_doc(fp, 8)
yaml.dump(data, sys.stdout)

and get:

document: 7   # < the 8th document

in all of the above cases, efficiently, without even tokenizing the preceding YAML documents (nor the following).

There is an additional caveat in that the YAML file could start with a byte-order-marker, and that the individual documents within a stream can start with these markers. The above routine doesn't handle that.

¹ _{This was done using ruamel.yaml of which I am the author, and which is an enhanced version of PyYAML. AFAIK PyYAML would work the same (but would e.g. drop the comment on the roundtrip).}
² _{Technically the directive is in its own directives document, so you should count that as document but the .load_all() doesn't give you that document back, so I don't count it as such.}

Load specific PyYAML documents from file

Answers (1)

Related Questions