Reputation: 369
Using PyYAML for loading a YAML (large) file which has duplicate keys. I would like to preserve all keys and would modify duplicate key according to project need. But it seems PyYAML is silently overwrites results with the last key and not getting a chance to modify it as my need (loss of information), resulting in this dict: {'blocks':{'a':'b2:11 c2:22'}}
simple example YAML:
import yaml
given_str = '''
blocks:
a:
b1:1
c1:2
a:
b2:11
c2:22'''
p = yaml.load(given_str)
How can I load the YAML with duplicate keys so that I get a chance to recursively traverse it and modify keys as my need. I need to load YAML and then transfer it into a database.
Upvotes: 1
Views: 1502
Reputation: 76812
Assuming your input YAML has no merge keys ('<<'), no tags and no comments you want to preserve, you can use the following:
import sys
import ruamel.yaml
from pathlib import Path
from collections.abc import Hashable
file_in = Path('input.yaml')
class MyConstructor(ruamel.yaml.constructor.SafeConstructor):
def construct_mapping(self, node, deep=False):
"""deep is True when creating an object/mapping recursively,
in that case want the underlying elements available during construction
"""
if not isinstance(node, ruamel.yaml.nodes.MappingNode):
raise ConstructorError(
None, None, f'expected a mapping node, but found {node.id!s}', node.start_mark,
)
total_mapping = self.yaml_base_dict_type()
if getattr(node, 'merge', None) is not None:
todo = [(node.merge, False), (node.value, False)]
else:
todo = [(node.value, True)]
for values, check in todo:
mapping: Dict[Any, Any] = self.yaml_base_dict_type()
for key_node, value_node in values:
# keys can be list -> deep
key = self.construct_object(key_node, deep=True)
# lists are not hashable, but tuples are
if not isinstance(key, Hashable):
if isinstance(key, list):
key = tuple(key)
if not isinstance(key, Hashable):
raise ConstructorError(
'while constructing a mapping',
node.start_mark,
'found unhashable key',
key_node.start_mark,
)
value = self.construct_object(value_node, deep=deep)
if key in mapping:
pat = key + '_undup_{}'
index = 0
while True:
nkey = pat.format(index)
if nkey not in mapping:
key = nkey
break
index += 1
mapping[key] = value
total_mapping.update(mapping)
return total_mapping
yaml = ruamel.yaml.YAML(typ='safe')
yaml.default_flow_style = False
yaml.Constructor = MyConstructor
data = yaml.load(file_in)
yaml.dump(data, sys.stdout)
which gives:
blocks:
a: b1:1 c1:2
a_undup_0: b2:11 c2:22
Please note that the values for both a
keys are multiline plain scalars. For b1
and c1
to be a key
the mapping value indicator (:
, the colon) needs to be followed by a whitespace character:
a:
b1: 1
c1: 2
Upvotes: 1
Reputation: 369
After reading many forums, I think best solution is create a wrapper for yml loader (removing duplicates) is the solution. @Anthon - any comment?
import yaml
from collections import defaultdict, Counter
####### Preserving Duplicate ###################
def parse_preserving_duplicates(input_file):
class PreserveDuplicatesLoader(yaml.CLoader):
pass
def map_constructor(loader, node, deep=False):
"""Walk tree, removing degeneracy in any duplicate keys"""
keys = [loader.construct_object(node, deep=deep) for node, _ in node.value]
vals = [loader.construct_object(node, deep=deep) for _, node in node.value]
key_count = Counter(keys)
data = defaultdict(dict) # map all data removing duplicates
c = 0
for key, value in zip(keys, vals):
if key_count[key] > 1:
data[f'{key}{c}'] = value
c += 1
else:
data[key] = value
return data
PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
map_constructor)
return yaml.load(input_file, PreserveDuplicatesLoader)
##########################################################
with open(inputf, 'r') as file:
fp = parse_preserving_duplicates(input_file)
Upvotes: 0