SU3
SU3

Reputation: 5409

Parse yaml list elements one at a time in python

Is there a yaml library in python that can read an input file an entry at a time, as needed, rather than parsing the whole file? I have a long file with list as the root node. If I'm trying to find the first element satisfying a certain property, I may not need to read and parse the whole file, and get the result faster.

Upvotes: 1

Views: 1506

Answers (1)

flyx
flyx

Reputation: 39788

You can use PyYAML's low-level parse() API:

import yaml

for event in yaml.parse(input):
    # process event

The events are documented here.

If you want to construct each item of a root-level sequence into a native Python value, you need to use the Composer and Constructor classes. Composer reads events and transforms them into nodes, Constructor builds Python values from nodes. This corresponds to the loading process defined in the YAML spec:


(source: yaml.org)

Now PyYAML's Composer expects functions get_event, check_event and peek_event to exist on self, but doesn't implement them. They are implemented by Parser. Therefore, to have a working YAML loading chain, PyYAML later does:

class Loader(Reader, Scanner, Parser, Composer, Constructor, Resolver):
  def __init__(self, stream):
    Reader.__init__(self, stream)
    Scanner.__init__(self)
    Parser.__init__(self)
    Composer.__init__(self)
    Constructor.__init__(self)
    Resolver.__init__(self)

For you, this means that you need a Loader object and use the Parser API for top-level events, along with the Composer and Constructor API to load each item in the top-level sequence.

Here's some code that gets you started:

import yaml

input = """
- "A": 1
- "B": 2
- foo
- 1
"""

loader = yaml.SafeLoader(input)

# check proper stream start (should never fail)
assert loader.check_event(yaml.StreamStartEvent)
loader.get_event()
assert loader.check_event(yaml.DocumentStartEvent)
loader.get_event()

# assume the root element is a sequence
assert loader.check_event(yaml.SequenceStartEvent)
loader.get_event()

# now while the next event does not end the sequence, process each item
while not loader.check_event(yaml.SequenceEndEvent):
    # compose current item to a node as if it was the root node
    node = loader.compose_node(None, None)
    # construct a native Python value with the node.
    # we set deep=True for complete processing of all the node's children
    value = loader.construct_object(node, True)
    print(value)

# assume document ends and no further documents are in stream
loader.get_event()
assert loader.check_event(yaml.DocumentEndEvent)
loader.get_event()
assert loader.check_event(yaml.StreamEndEvent)

Be aware that you might run into problems if you have anchors & aliases in the YAML document.

Upvotes: 2

Related Questions