Dibyendu Dey
Dibyendu Dey

Reputation: 369

PyYAML loader with duplicate keys

Using PyYAML for loading a YAML (large) file which has duplicate keys. I would like to preserve all keys and would modify duplicate key according to project need. But it seems PyYAML is silently overwrites results with the last key and not getting a chance to modify it as my need (loss of information), resulting in this dict: {'blocks':{'a':'b2:11 c2:22'}}

simple example YAML:

import yaml
given_str = '''
   blocks:
      a:
        b1:1
        c1:2
    
      a:
        b2:11
        c2:22'''
p = yaml.load(given_str)

How can I load the YAML with duplicate keys so that I get a chance to recursively traverse it and modify keys as my need. I need to load YAML and then transfer it into a database.

Upvotes: 1

Views: 1502

Answers (2)

Anthon
Anthon

Reputation: 76812

Assuming your input YAML has no merge keys ('<<'), no tags and no comments you want to preserve, you can use the following:

import sys
import ruamel.yaml
from pathlib import Path
from collections.abc import Hashable

file_in = Path('input.yaml')

class MyConstructor(ruamel.yaml.constructor.SafeConstructor):
    def construct_mapping(self, node, deep=False):
        """deep is True when creating an object/mapping recursively,
        in that case want the underlying elements available during construction
        """
        if not isinstance(node, ruamel.yaml.nodes.MappingNode):
            raise ConstructorError(
                None, None, f'expected a mapping node, but found {node.id!s}', node.start_mark,
            )
        total_mapping = self.yaml_base_dict_type()
        if getattr(node, 'merge', None) is not None:
            todo = [(node.merge, False), (node.value, False)]
        else:
            todo = [(node.value, True)]
        for values, check in todo:
            mapping: Dict[Any, Any] = self.yaml_base_dict_type()
            for key_node, value_node in values:
                # keys can be list -> deep
                key = self.construct_object(key_node, deep=True)
                # lists are not hashable, but tuples are
                if not isinstance(key, Hashable):
                    if isinstance(key, list):
                        key = tuple(key)
                if not isinstance(key, Hashable):
                    raise ConstructorError(
                        'while constructing a mapping',
                        node.start_mark,
                        'found unhashable key',
                        key_node.start_mark,
                    )

                value = self.construct_object(value_node, deep=deep)
                if key in mapping:
                    pat = key + '_undup_{}'
                    index = 0
                    while True:
                        nkey = pat.format(index)
                        if nkey not in mapping:
                            key = nkey
                            break
                        index += 1
                mapping[key] = value
            total_mapping.update(mapping)
        return total_mapping

 
yaml = ruamel.yaml.YAML(typ='safe')
yaml.default_flow_style = False
yaml.Constructor = MyConstructor
data = yaml.load(file_in)
yaml.dump(data, sys.stdout)

which gives:

blocks:
  a: b1:1 c1:2
  a_undup_0: b2:11 c2:22

Please note that the values for both a keys are multiline plain scalars. For b1 and c1 to be a key the mapping value indicator (:, the colon) needs to be followed by a whitespace character:

a:
  b1: 1
  c1: 2

Upvotes: 1

Dibyendu Dey
Dibyendu Dey

Reputation: 369

After reading many forums, I think best solution is create a wrapper for yml loader (removing duplicates) is the solution. @Anthon - any comment?

import yaml
from collections import defaultdict, Counter

####### Preserving Duplicate ###################
def parse_preserving_duplicates(input_file):
    class PreserveDuplicatesLoader(yaml.CLoader):
        pass

    def map_constructor(loader, node, deep=False):
        """Walk tree, removing degeneracy in any duplicate keys"""
        keys = [loader.construct_object(node, deep=deep) for node, _ in node.value]
        vals = [loader.construct_object(node, deep=deep) for _, node in node.value]
        key_count = Counter(keys)
        data = defaultdict(dict)  # map all data removing duplicates 
        c = 0
        for key, value in zip(keys, vals):
            if key_count[key] > 1:
                data[f'{key}{c}'] = value
                c += 1
            else:
                data[key] = value
            return data

    PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
                                             map_constructor)
    return yaml.load(input_file, PreserveDuplicatesLoader)
##########################################################
with open(inputf, 'r') as file:
    fp = parse_preserving_duplicates(input_file)

Upvotes: 0

Related Questions