Reputation: 736

Getting duplicate keys in YAML using Python

We are in need of parsing YAML files which contain duplicate keys and all of these need to be parsed. It is not enough to skip duplicates. I know this is against the YAML spec and I would like to not have to do it, but a third-party tool used by us enables this usage and we need to deal with it.

File example:

build:
  step: 'step1'

build:
  step: 'step2'

After parsing we should have a similar data structure to this:

yaml.load('file.yml')
# [('build', [('step', 'step1')]), ('build', [('step', 'step2')])]

dict can no longer be used to represent the parsed contents.

I am looking for a solution in Python and I didn't find a library supporting this, have I missed anything?

Alternatively, I am happy to write my own thing but would like to make it as simple as possible. ruamel.yaml looks like the most advanced YAML parser in Python and it looks moderately extensible, can it be extended to support duplicate fields?

Upvotes: 19

Answers (4)

Falko

Reputation: 17907

Here is an alternative implementation based on Anthon's answer and ruamel.yaml. It is rather generic and uses lists for duplicates, while other entries are left unchanged.

from collections import Counter
from ruamel.yaml import YAML
from ruamel.yaml.constructor import SafeConstructor

yaml_str = '''
a: 1
b: 2
b: 2
'''

def construct_yaml_map(self, node):
    data = {}
    yield data
    keys = [self.construct_object(node, deep=True) for node, _ in node.value]
    vals = [self.construct_object(node, deep=True) for _, node in node.value]
    key_count = Counter(keys)
    for key, val in zip(keys, vals):
        if key_count[key] > 1:
            if key not in data:
                data[key] = []
            data[key].append(val)
        else:
            data[key] = val

SafeConstructor.add_constructor(u'tag:yaml.org,2002:map', construct_yaml_map)
yaml = YAML(typ='safe')
data = yaml.load(yaml_str)
print(data)

Output:

{'a': 1, 'b': [2, 2]}

The same is possible with the pyyaml package (inspired by Wilfred Hughes' answer):

from collections import Counter
import yaml

yaml_str = '''
a: 1
b: 2
b: 2
'''

def parse_preserving_duplicates(src):
    class PreserveDuplicatesLoader(yaml.loader.Loader):
        pass

    def map_constructor(loader, node, deep=False):
        keys = [loader.construct_object(node, deep=deep) for node, _ in node.value]
        vals = [loader.construct_object(node, deep=deep) for _, node in node.value]
        key_count = Counter(keys)
        data = {}
        for key, val in zip(keys, vals):
            if key_count[key] > 1:
                if key not in data:
                    data[key] = []
                data[key].append(val)
            else:
                data[key] = val
        return data

    PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, map_constructor)
    return yaml.load(src, PreserveDuplicatesLoader)

print(parse_preserving_duplicates(yaml_str))

Output:

{'a': 1, 'b': [2, 2]}

Upvotes: 1

Wilfred Hughes

Reputation: 31171

You can override how pyyaml loads keys. For example, you could use a defaultdict with lists of values for each keys:

from collections import defaultdict
import yaml


def parse_preserving_duplicates(src):
    # We deliberately define a fresh class inside the function,
    # because add_constructor is a class method and we don't want to
    # mutate pyyaml classes.
    class PreserveDuplicatesLoader(yaml.loader.Loader):
        pass

    def map_constructor(loader, node, deep=False):
        """Walk the mapping, recording any duplicate keys.

        """
        mapping = defaultdict(list)
        for key_node, value_node in node.value:
            key = loader.construct_object(key_node, deep=deep)
            value = loader.construct_object(value_node, deep=deep)

            mapping[key].append(value)

        return mapping

    PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, map_constructor)
    return yaml.load(src, PreserveDuplicatesLoader)

Upvotes: 6

Anthon

Reputation: 76802

PyYAML will just silently overwrite the first entry, ruamel.yaml¹ will give a DuplicateKeyFutureWarning if used with the legacy API, and raise a DuplicateKeyError with the new API.

If you don't want to create a full Constructor for all types, overwriting the mapping constructor in SafeConstructor should do the job:

import sys
from ruamel.yaml import YAML
from ruamel.yaml.constructor import SafeConstructor

yaml_str = """\
build:
  step: 'step1'

build:
  step: 'step2'
"""


def construct_yaml_map(self, node):
    # test if there are duplicate node keys
    data = []
    yield data
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        val = self.construct_object(value_node, deep=True)
        data.append((key, val))


SafeConstructor.add_constructor(u'tag:yaml.org,2002:map', construct_yaml_map)
yaml = YAML(typ='safe')
data = yaml.load(yaml_str)
print(data)

which gives:

[('build', [('step', 'step1')]), ('build', [('step', 'step2')])]

However it doesn't seem necessary to make step: 'step1' into a list. The following will only create the list if there are duplicate items (could be optimised if necessary, by caching the result of the self.construct_object(key_node, deep=True)):

def construct_yaml_map(self, node):
    # test if there are duplicate node keys
    keys = set()
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        if key in keys:
            break
        keys.add(key)
    else:
        data = {}  # type: Dict[Any, Any]
        yield data
        value = self.construct_mapping(node)
        data.update(value)
        return
    data = []
    yield data
    for key_node, value_node in node.value:
        key = self.construct_object(key_node, deep=True)
        val = self.construct_object(value_node, deep=True)
        data.append((key, val))

which gives:

[('build', {'step': 'step1'}), ('build', {'step': 'step2'})]

Some points:

Probably needless to say, this will not work with YAML merge keys (<<: *xyz)
If you need ruamel.yaml's round-trip capabilities (yaml = YAML()) , that will require a more complex construct_yaml_map.
If you want to dump the output, you should instantiate a new YAML() instance for that, instead of re-using the "patched" one used for loading (it might work, this is just to be sure):
```
yaml_out = YAML(typ='safe')
yaml_out.dump(data, sys.stdout)
```
which gives (with the first construct_yaml_map):
```
- - build
  - - [step, step1]
- - build
  - - [step, step2]
```
What doesn't work in PyYAML nor ruamel.yaml is yaml.load('file.yml'). If you don't want to open() the file yourself you can do:
```
from pathlib import Path  # or: from ruamel.std.pathlib import Path
yaml = YAML(typ='safe')
yaml.load(Path('file.yml')
```

¹ _{Disclaimer: I am the author of that package.}

Upvotes: 15

Simon Fraser

Reputation: 2818

If you can modify the input data very slightly, you should be able to do this by converting the single yaml-like file into multiple yaml documents. yaml documents can be in the same file if they're separated by --- on a line by itself, and you handily appear to have entries separated by two newlines next to each other:

with open('file.yml', 'r') as f:
    data = f.read()
    data = data.replace('\n\n', '\n---\n')

    for document in yaml.load_all(data):
        print(document)

Output:

{'build': {'step': 'step1'}}
{'build': {'step': 'step2'}}

Upvotes: 2

Getting duplicate keys in YAML using Python

Answers (4)

Related Questions